Has DeepSeek’s OpenAI copying been exposed?

OpenAI boss Sam Altman will attend the Paris summit and an appearance by DeepSeek’s Liang Wenfeng is under discussion – Copyright AFP Lionel BONAVENTURE

Did DeepSeek-R1 train on OpenAI’s model? The answer is ‘yes’, according to new research from Copyleaks, a company that works on AI detection and governance. DeepSeek is the name of a free AI-powered chatbot, which looks, feels and works very much like ChatGPT.

Researchers built a text fingerprinting tool that can determine which AI model wrote a given text. After training it on thousands of AI-generated samples, the technologists tested it on known models—and the results were clear:

74.2 percent of R1’s texts match OpenAI’s style, which strongly suggests DeepSeek used OpenAI in its training.

For comparison, Microsoft’s Phi-4 model had 99.3 percent “disagreement,” meaning it showed no resemblance to any known model, confirming independent training. DeepSeek’s overwhelming similarity to OpenAI, on the other hand, is clear replication/copying.

OpenAI has made Europe a priority in its expansion of physical offices around the world, with sites in Paris, Brussels and Dublin
OpenAI has made Europe a priority in its expansion of physical offices around the world, with sites in Paris, Brussels and Dublin – Copyright AFP/File Lionel BONAVENTURE

This discovery raises concerns about DeepSeek-R1’s resemblance to OpenAI’s model, particularly regarding data sourcing, intellectual property rights, and transparency.

The Copyleaks Data Science Team conducted the research, led by Yehonatan Bitton, Shai Nisan, and Elad Bitton. The methodology involved a “unanimous jury” approach, relying on three distinct detection systems to classify AI-generated texts, with a judgment made only when all systems agreed.

There are also operational issues since an undisclosed reliance on existing models can reinforce biases, limit diversity, and pose legal or ethical risks. Beyond technical issues, DeepSeek’s claims of a groundbreaking, low-cost training method—if based on unauthorized distillation of OpenAI—may have misled the market, contributing to NVIDIA’s $593 billion single-day loss and giving DeepSeek an unfair advantage.

Using a highly rigorous approach, the research combined three advanced AI classifiers, each trained on texts from four major models: Claude, Gemini, Llama, and OpenAI. These classifiers identified subtle stylistic features like sentence structure, vocabulary, and phrasing. What made the method particularly effective was its “unanimous jury” system, where all three classifiers had to agree before a classification was made.

This ensured a robust check against false positives, resulting in an impressive 99.88 percent precision rate and just a 0.04 percent false-positive rate, accurately identifying texts from both known and unknown AI models.

“With this research, we have moved beyond general AI detection as we knew it and into model-specific attribution, a breakthrough that fundamentally changes how we approach AI content,” Shai Nisan, Chief Data Scientist at Copyleaks says in a statement provided Digital Journal.

Nisan adds: “This capability is crucial for multiple reasons, including improving overall transparency, ensuring ethical AI training practices, and, most importantly, protecting the intellectual property rights of AI technologies and, hopefully, preventing their potential misuse.”

Has DeepSeek’s OpenAI copying been exposed?

#DeepSeeks #OpenAI #copying #exposed

Leave a Reply

Your email address will not be published. Required fields are marked *