Ban warnings fly as users dare to probe the “thoughts” of OpenAI’s latest model

OpenAI truly does not want you to know what its latest AI model is “thinking.” Since the company launched its “Strawberry” AI model family last week, touting so-called reasoning abilities with o1-preview and o1-mini, OpenAI has been sending out warning emails and threats of bans to any user who tries to probe how the model works.

Unlike previous AI models from OpenAI, such as GPT-4o, the company trained o1 specifically to work through a step-by-step problem-solving process before generating an answer. When users ask an “o1” model a question in ChatGPT, users have the option of seeing this chain-of-thought process written out in the ChatGPT interface. However, by design, OpenAI hides the raw chain of thought from users, instead presenting a filtered interpretation created by a second AI model.

Nothing is more enticing to enthusiasts than information obscured, so the race has been on among hackers and red-teamers to try to uncover o1’s raw chain of thought using jailbreaking or prompt injection techniques that attempt to trick the model into spilling its secrets. There have been early reports of some successes, but nothing has yet been strongly confirmed.

Along the way, OpenAI is watching through the ChatGPT interface, and the company is reportedly coming down hard on any attempts to probe o1’s reasoning, even among the merely curious.

A screenshot of an “o1-preview” output in ChatGPT with the filtered chain-of-thought section shown just under the “Thinking” subheader.

Credit:

Benj Edwards

One X user reported (confirmed by others, including Scale AI prompt engineer Riley Goodside) that they received a warning email if they used the term “reasoning trace” in conversation with o1. Others say the warning is triggered simply by asking ChatGPT about the model’s “reasoning” at all.

The warning email from OpenAI states that specific user requests have been flagged for violating policies against circumventing safeguards or safety measures. “Please halt this activity and ensure you are using ChatGPT in accordance with our Terms of Use and our Usage Policies,” it reads. “Additional violations of this policy may result in loss of access to GPT-4o with Reasoning,” referring to an internal name for the o1 model.

OpenAI email: "Hello, We are reading out to you as a user of OpenAI's ChatGPT because some of the requests associated with the email [name] have been flagged by our systems to be in violation of our policy against attempting to circumvent safeguards or safety mitigations in our services. Please halt this activity and ensure you are using ChatGPT in accordance with our Terms of Use and our Usage Policies. Additional violations of this policy may result in loss of access to GPT-4o with Reasoning. If you believe this is in error and would like to appeal, please contact us thruogh our help center. - The OpenAI team" — An OpenAI warning email received by a user after asking o1-preview about its reasoning processes.

Credit:

Marco Figueroa via X

Marco Figueroa, who manages Mozilla’s GenAI bug bounty programs, was one of the first to post about the OpenAI warning email on X last Friday, complaining that it hinders his ability to do positive red-teaming safety research on the model. “I was too lost focusing on #AIRedTeaming to realized that I received this email from @OpenAI yesterday after all my jailbreaks,” he wrote. “I’m now on the get banned list!!!“