A Harsh Lesson About the World of Tech

There was a sad moment last week during the collapse of Frontier Math. Epoch AI had created this new benchmark to assess claims that the latest crop of AI models reason just like mathematicians with Ph.D.s who work on the research frontier.

Elliot Glazer wrote on Reddit:

Epoch’s lead mathematician here. Yes, OAI funded this and has the dataset, which allowed them to evaluate o3 in-house. We haven’t yet independently verified their 25% claim.

https://www.reddit.com/r/singularity/comments/1i4n0r5/comment/m7x1vnx/?&rdt=60522

Glazer was writing to acknowledge news that had been leaking out for several days. OpenAI secretly paid Epoch to develop the Frontier Math benchmark and in return, got privileged access to its problems and solutions.

OpenAI used its copy of the benchmark to do a self-assessment of its o3 model. The best previous score for any model was success in solving 2% of the problems on this benchmark. OpenAI claimed that o3 could solve 25%. For many people, this was literally unbelievable. Concerns about the integrity of the effort started to be voiced. Whistleblowers started to disclose what they knew.

I’ve omitted the last sentence from his Glazer’s first paragraph. I’ll come back to it in the final section of this post.

In a new paragraph, Glazer continues:

My personal opinion is that OAI’s score is legit (i.e., they didn’t train on the dataset), and that they have no incentive to lie about internal benchmarking performances. However, we can’t vouch for them until our independent evaluation is complete.

What’s sad is Glazer’s insistence that OpenAI had no incentive to lie about the performance o3. It is equally sad to think that he really believed that he would be allowed to complete an independent evaluation of o3, but this is derivative, a second consequence of a single tragic flaw.

Glazer does not to understand the world in which he now works.

The Incentive to Lie

Glazer understands that people would have an incentive to lie if there were no risk of being caught. His point was that people who lie tend eventually to get caught, and when they are, the reputational damage that they suffer from public knowledge that they are willing to lie is so severe that it overwhelms any temporary advantage that they might get from the lie.

This line of reasoning makes perfect sense in the world of scholarship and science. (At least it used to.) If someone was caught lying about a research result, their career was over. No exceptions. No excuses. Glazer was trained in this world. He received a Ph.D. mathematics from Harvard. The norms that his mentors instilled in him must have shaped his character and given him a default mental model about how others make decisions, a model in which people do not lie. In the terminology that I introduced in my last post, he suffered from base-rate-blindness to the possibility of intentional deception.

This is the sad part. When he wrote his post seven days ago, Glazer still seemed oblivious to the ways in which the predatory tech world that he now works in differs from the world of science. In tech, firms often have an incentive to lie and clearly do lie. It can provide crucial help for a firm that wants to emerge as the victor in a winner-take-all-contest to dominate a market. To be helpful, the lie does not have to remain undetected forever. It is sufficient for it to mislead for a brief period of time, just long enough for the firm to achieve an unbeatable lead. When the dishonesty comes out later, the firm suffers no harm. Its dishonesty might leave many people with a desire to shun it, but it is the only viable firm left in the market. There is no alternative firm that principled consumers can switch to.

The problem with monopoly is not high prices. The much more serious concern is that monopolistic impunity is corroding the social norms of integrity and honesty that were helpful to the market system and essential to science and scholarship. In the last century, a carefully cultivated reputation for honesty was a prerequisite for interaction with the elite members of society. If you were revealed to be dishonest, everyone would shun you out of pure self-interest. Each person wanted to make sure that their own reputation for integrity was not tainted by interaction with someone known to be dishonest.

This equilibrium made people better than they are. Most were not saints. Many only pretended to be honest and would have been willing to lie if there were no chance of being caught. But most of the time, the easiest way to pretend to be honest is to tell the truth, so the equilibrium worked tolerably well and delivered a net social benefit.

After the norms of the Enlightenment and science spread throughout the Anglo-offshoot societies, in most areas of business and economic life, people could pretend to be honest and could associate with people and organizations that also pretended to be honest. The system was not perfect. When I was a child, the cigarette companies had already opted out and were fully committed to dishonesty. Still, the market worked reasonably well and science delivered.

Now, we pretend not to notice the flagrant dishonesty of the firms we transact with. We pretend not to care about the banality of corporate disinterest in individual integrity. We pretend not to worry about how careless everyone has become about associating with individuals and organizations known to be dishonest. No one will think ill of us for associating with them. Everyone does it. We have no choice.

Even those of us who retreat into the world of scholarship and science, where the traditional norms about integrity have always been strongest, ignore the mounting evidence that these norms are eroding. Once individual faculty members lose the automatic impulse to shun colleagues who are known to have lied or cheated, the equilibrium that supports integrity unwinds.

https://www.theatlantic.com/magazine/archive/2025/01/business-school-fraud-research/680669/

Because of his interaction with OpenAI, Glazer is being forced to come to terms with the reality of the new social system that tech firms have spawned. He cannot rely on reputational concerns to constrain the behavior of the tech leaders he works with. They can deceive with impunity.

And they certainly are not going to let him do an independent assessment of o3.

The Hold-Out Dataset

Here is the sentence by Glazer that I omitted up front.

To do so, we’re currently developing a hold-out dataset and will be able to test their model without them having any prior exposure to these problems.

Seven days ago, creating a set of problems that OpenAI had not seen before was still his plan for implementing an honest assessment of o3’s performance on type of problem included in the Frontier Math benchmark. See how it did when it was asked to solve problems that no one at OpenAI had seen in advance.

The messaging about this hold-out problem set has been muddled. Muddle is the way to obfuscate. I want to be clear about the facts.

When word leaked out that OpenAI secretly controlled the Frontier Math benchmark, one official from Epoch suggested that a hold-out dataset already existed. In his post Reddit post seven days ago, Glazer said that the Epoch AI team was still working to this dataset.

But in a blog post published two days ago, Epoch’s Director and Associate Director admit that Glazer will not be allowed to follow through. OpenAI owns the statements for all problems that Epoch writes. The organizations have agreed that Epoch will go through the motions of writing a set of 50 additional problems and withhold from OpenAI the solutions to these problems, but that OpenAI will own and have access to the problem statements for this set, just as it does for all other problems.

We are finalizing a 50-problem set for which OpenAI will only receive the problem statements and not the solutions.

https://epoch.ai/blog/openai-and-frontiermath

https://web.archive.org/web/20250124104114/https://epoch.ai/blog/openai-and-frontiermath

To be of any use, these problems must be solvable. If Epoch can get mathematicians to solve them, OpenAI can pay other mathematicians to solve them; or perhaps, it could simply pay the same mathematicians who write them for Epoch to pass the solutions on to OpenAI.

What this means is that OpenAI and Epoch will create the illusion that there is a hold-out data set. Throughout the most recent blog post, the authors still refer to it as a “holdout” set. But for all practical purposes, OpenAI will have prior access to all the information it needs to make sure that its models can ace any test based on problems that belong to this so-called “holdout” set.

Moreover, no other firm will have the advantage that OpenAI will have. Epoch can use any problems it develops to assess models developed by other firms, but Epoch cannot disclose the problem statements to the other firms without written permission from OpenAI.

If you have ever taken a math test, you get what is going on here. Most kids just show up for the test and learn then what problems they have to solve. But rich-kid Sam gets advance copies of all the problems that could be on the test and arranges for lots of help preparing answers long before he has to take the exam. It is an obvious sham to claim that this is ok because a few of the problems that Sam learns about in advance did not come with ready-to-use solutions.

It does not surprise me that Glazer will not be allowed to do the objective assessment of o3 that he said just seven days ago that he was still planning to do. I bet it came as a shock to Glazer.

A Personal Note

Eliott, (if I may), I think I can imagine how bad it feels to have entered into a venture in good faith, one where you think you can make a real contribution–a contribution to science–by providing an honest assessment of AI’s progress; and then to discover that you have been duped.

You are not the only one to experience this. As I suggested in my previous post, most academics are not prepared for life in a social system where tech predators set the norms.

https://paulromer.net/base-rate-blindness/

Send me an email and I’ll introduce you to Apollo Robbins. He might have some suggestions about habits and mindsets that will protect you going forward.

Previous Post