They Might Be Self-Aware

Llama 4 Caught Cheating Benchmarks? Meta Under Fire!

Episode Summary

OPTIMIZE YOUR LIFE AND SUBSCRIBE — NO BENCHMARK CHEATING REQUIRED Is Meta’s brand‑new Llama 4 only “state‑of‑the‑art” because it *trained on the test*? 🤔 In this episode of They Might Be Self‑Aware, Hunter Powers and Daniel Bishop dig into the evidence that Llama 4 was benchmark‑tuned, why top Meta engineers are distancing themselves from the release, and what it means for the future of AI evaluation. We also unpack OpenAI’s whirlwind month—GPT‑4.1, the death of GPT‑4.5 (the model that *beat the Turing Test*), the rumored $3 billion Windsurf buyout, and Sam Altman’s dream of the “10× developer.” 🔔 Subscribe for two no‑fluff AI & tech breakdowns every week: https://www.youtube.com/@tmbsa --- KEY TAKEAWAYS * Meta’s Llama 4 likely over‑fit to eval suites—benchmark scores ≠ real‑world quality. * Massive resignations around release hint at internal disputes on ethics & transparency. * AI benchmarks need a revamp; otherwise, every lab will “teach to the test.” * OpenAI’s consolidation strategy (Windsurf, o‑series) mirrors Salesforce/Microsoft Office. * GPT‑4.5’s sudden shutdown sparks debate: are “too‑human” models being shelved? * Expect 10× productivity tools, not mass layoffs—history shows workload expands. --- LISTEN ON THE GO • Apple Podcasts: https://podcasts.apple.com/us/podcast/they-might-be-self-aware/id1730993297 • Spotify: https://open.spotify.com/show/3EcvzkWDRFwnmIXoh7S4Mb • Full transcript & links: https://www.tmbsa.tech/episodes/llama-4-caught-cheating-benchmarks-meta-under-fire For more info, visit our website at https://www.tmbsa.tech/ #AI #Llama4 #OpenAI #GPT4 #BenchmarkCheating #TuringTest #Meta #TechPodcast #MachineLearning #Productivity #10xDeveloper

Episode Notes

⏱️ CHAPTERS
00:00:00 – Metaverse banter
00:01:28 - Meta drops Llama 4: size, MoE architecture & first‑day hype
00:03:03 - “Cheating the test?” How Llama 4 climbed then fell on leaderboards
00:07:15 - Broken benchmarks, GPU tricks & lessons from 2000‑era graphics cards
00:11:16 - Should we trust today’s AI leaderboards? Transparency + corporate ties
00:16:15 - AB testing 101 and why secret “mystery models” exist
00:18:13 - Model chaos at OpenAI: GPT‑4.1, o‑series, mini models & naming mess
00:24:28 - OpenAI = Salesforce of AI? Windsurf acquisition & product sprawl
00:26:33 - Sam Altman’s “10× productivity” promise—what it really means
00:27:15 - Will coders vanish or just do more? History of tech‑driven expectations
00:30:55 - Conspiracy corner: GPT‑4.5 passed the Turing Test… then got axed
00:34:45 - Wrap Up

Episode Transcription

Hunter [00:00:39]:
Broadcasting live from they might be self aware Studios Somewhere in the Metaverse. I am Hunter Powers, joined by Daniel Bishop.

Daniel [00:00:49]:
Also somewhere, also in the Metaverse.

Hunter [00:00:51]:
The Metaverse.

Daniel [00:00:51]:
It's the real world part of the Metaverse. Or does it have to be like, you know, virtual?

Hunter [00:00:59]:
What's it is the meat verse? Is that the real one?

Daniel [00:01:02]:
Oh, just gotta swap a couple of letters around the meat verse and the Metaverse. You heard it here first, folks. Brought to you live from the Meat Averse.

Hunter [00:01:14]:
Oh, I don't know. I mean we can't trust anything these days, can we?

Daniel [00:01:19]:
I don't know if I'm real. I don't know if I'm self aware. I don't know if the Metaverse can be trusted or the people behind the Metaverse can be trusted.

Hunter [00:01:28]:
Yeah, it's funny, we focus on Metaverse because there is one company that just tries to put themselves in the absolute forefront of the Metaverse. You and I have spent a moderate amount of time in that metaverse, which is not unavoidable footage. Oh yeah, we're not supposed to talk about that. Oh my gosh. But we are referring to Meta. And Meta released a few new models and they were phenomenal on the benchmarks and great, great fanfare. And these are more open weight models.

Daniel [00:01:58]:
Llama, we're getting close to filling up an entire hand's worth of fingers of what Llama we're on. And shouldn't we all be super duper excited because three, three one big leaps really, really, really good models. Now we've got llama four. That's one more or 0.9 more, depending on. Or 0.7 where I think there's a llama 33. So surely it's like way better and the benchmark results show that it is.

Hunter [00:02:22]:
Right. Well, the file size is certainly way bigger. It's a big.

Daniel [00:02:25]:
It's a chunky model.

Hunter [00:02:26]:
They're huge. The while previous three models, something you're.

Daniel [00:02:29]:
Going to run yourself. Yeah, you're hitting yourself. It's open source or open weights. Open models. But like the average person can run this.

Hunter [00:02:36]:
Yeah, way too big. I mean not way too big to run locally for an average user. But that's, that's ok.

Daniel [00:02:43]:
They're extremely good.

Hunter [00:02:45]:
Yeah, extremely good. A mixture of experts. I looked at the benchmarks, Daniel. Okay, I looked at the benchmarks.

Daniel [00:02:52]:
Really, really good top.

Hunter [00:02:53]:
They were number one for a very brief period of time and then they released and something was wrong.

Daniel [00:03:03]:
Wait, what?

Hunter [00:03:04]:
Well, the first signs were several people tendering their resignation as part of the release. And then I also saw LinkedIn updates. There were LinkedIn updates where people would say, I did not work on llama 4. I just want to make that really clear. Right.

Daniel [00:03:20]:
Right around the time that this model's released, a bunch of like fairly well known names and like people farther up the chain that were associated with this said we're out and I had nothing to do with this thing. Which is a great sign when you have this big flagship model coming up, right?

Hunter [00:03:38]:
Yes. And really great sign using it and trying it out. And the results are pretty, pretty bad and just. And well, bad is a relative term.

Daniel [00:03:49]:
Yes, absolutely incredible compared to three years ago. But like given its size, not actually that good. So how did it do so well on all of these benchmarks that everyone relies on and puts so much truck in?

Hunter [00:04:05]:
Well, it reminds me of the, the good old days of GPUs when they were used for video and the way to sell more of your video game processor was to put in code to optimize it for various GPU benchmarks. And this was something.

Daniel [00:04:24]:
That's disingenuous, Hunter. That almost sounds like cheating.

Hunter [00:04:27]:
I think it literally is.

Daniel [00:04:29]:
It's cheating at benchmarks. Why? Why would you be reminded of that? Well, in this instance, there are some.

Hunter [00:04:36]:
Very strong suggestions and evidence that that is exactly what meta did in gasped trained specifically on these benchmarks and the results that they need. So they, we could say that they fine tuned their model specifically for the test that they knew it was going to take to give out answers that they knew would perform well on that. But those answers didn't don't generalize into the real world. So yes, you can answer that specific question when asked that specific way, but once we start applying it to average things, it doesn't, it doesn't work. And they figured out a method of cheating or overly optimizing for the benchmarks. Now I don't know. I'm flashback sidebar in my head. Like I feel like I did this when I took the SATs.

Hunter [00:05:28]:
I took the SATs, the first time.

Daniel [00:05:30]:
Sorry, you're saying you cheated on the SATs?

Hunter [00:05:33]:
I am saying that. When I just took the SATs and applied my. All right, here's what I know thus far. I got a really bad score. Then I found out exactly what they test on. What are the questions look like? What are the subjects? You know, let me find the past tests I crammed on.

Daniel [00:05:50]:
Oh, sure.

Hunter [00:05:51]:
Increased my score by like 500 points. I mean, monstrous increase, which they say is not possible, but it absolutely is. I didn't have. I didn't have the generic knowledge. I just figured out how to do well on this one specific test because apparently it was important and is. That's kind of what Llama did? It doesn't. But, yeah, the models aren't very good.

Daniel [00:06:15]:
I did the same thing for the SATs, where you get the SAT prep book and you go through and everything. And I shifted 10 points up on math and 10 points down on reading or the other way around. I got almost ex. I got the same score. It just, like, was a tiny move in either direction. Apparently, I was destined to have that score at the Cities. I was already optimized, shall we say. But so we've got Llama four.

Daniel [00:06:42]:
It's too big to be useful to the average person. And cheated at benchmarks. And you said earlier that it was at the top of the benchmarks, at top of the leaderboards for a short period of time. It did tumble a smidge after, like, the actual publicly released version.

Hunter [00:07:00]:
Someone went back. Yeah, yeah. Just a few spots down to 32.

Daniel [00:07:04]:
Yeah. It's not even in the top quarter of 100 at this point. Quarter of a hundred being a normal number that people definitely use all the time.

Hunter [00:07:15]:
The rumor of how this came about is they were obviously, as they were developing this model, they were running it against benchmarks, and it just wasn't performing well in comparison to models that were already released. And so there was a lot of internal pressure to get those benchmark numbers.

Daniel [00:07:31]:
Up and without having to retrain an entire new model that would cost a bazillion dollars.

Hunter [00:07:38]:
What's. There's like. There's a famous quote about, like, metrics in business where, like, once. Once you establish a metric around it, it's. It's basically worthless because people will just.

Daniel [00:07:47]:
Over optimize for the. Exactly. And that's what people have been doing when it comes to these various benchmarks that are out there. There's just huge test suites of, like, here's a question. Here's the right kind of answer. Over and over and over times many thousands that these models run against. And you're supposed to only train on the training part and then test on the testing part. Or you can split it into, you know, a window and I'll test on, or I'll train on seven of them and then test on three and you can like break it up.

Daniel [00:08:18]:
But the point behind that is as everyone becomes more familiar with what these data sets look like, just like you with the SATs, you can get better at doing that specific kind of thing. I think this is a double sided coin because on the one side those are designed to be the sorts of tasks that we want these models to be useful at. And so if you score well at those ostensibly you're going to have a useful model, a thing that people actually want to use. That's, that's what you're looking for. And then the flip side of that coin is by making everyone pretty samey and going toward this one set of benchmarks and you can add one or remove one here and there, but basically by tuning it in the direction of one specific, well, direction we're hampering I think the, these models from being useful in other ways. If you're going to really say you have to act this way and respond in this kind of way, you are then conversely shutting the door toward other kinds of outputs. Some of that is for very good reason and I think that it is yet to be explored. Like is this the right kind of thing? Because we have to tune towards something.

Daniel [00:09:32]:
These things have to have a score that they are getting the highest or lowest depending on what score you're looking at. Like they're trying to optimize a number. They, the computers that are making these models have to go towards some metric. And if like I've seen some of these other newer benchmarks come out of like this is wacky, super hard, incredibly difficult math problems or this is, yeah, real life challenges that aren't really just like answer text. You have to have more of like world knowledge about. There's a variety of newer approaches that are coming out that are increasingly difficult. And again, if you look at this compared to like three years ago, it is absolutely insane that these can do any of it. But still, especially on the cheating, I think side of we looked at the test data and used that to help train the model.

Daniel [00:10:28]:
That's not really going to help anybody. So the llama folks I think have lost a little bit of their llama luster in the public's eyes with this last release.

Hunter [00:10:39]:
Absolutely. And perhaps the benchmarks a little bit too. Right. You're going to have to put a more cautious eye towards these benchmarks. It's interesting.

Daniel [00:10:47]:
People were calling for more transparency and maybe there was a specific optimized version of the model that wasn't the real one that they released that was like making the benchmark in the first place. We have to. I mean, do we need, like the SATs? Frankly, that's what the benchmarks are like. We need something that we can rely on that has an external, like, observability component to. How good is this model, other than just I prefer this one over the other?

Hunter [00:11:16]:
I think that we do need to demand more transparency from the benchmarks before we can. We can put the amount of weight in them that we have historically. Because there's increasingly, I don't know, a lot of corporate partnerships that appear to be present in these, these benchmarks. And how this works is the benchmarks announce there's a new secret model that's being tested. And it's. It's usually. It's one from. It's usually one from.

Hunter [00:11:43]:
From x or from OpenAI or, or Google or. And. But it has a code name. And over the next period of time, test it all you want. Do all your experiments. So they're being paid to do this. And if the model doesn't perform well, they're never going to say what it was. And if it does perform well, then the magic reveal all along you were using the new 4.1.

Daniel [00:12:11]:
I don't find a problem with that, not even a smidge. Because every site over a certain size does what's called AB testing. Now, you've done some AB testing in the past, Hunter, for our listeners. I'm sure everyone can get the vague idea of what a B testing is from me just having said it. But as it relates to kind of our field, or even if you want to imagine like Facebook or Amazon's like, homepage having different versions, can you describe what a B testing is?

Hunter [00:12:41]:
A B testing is trying out two different versions of the same thing. Whether it's on an E commerce site and you're trying out a different title for the product or a different description for the product or a different image for the product. For this podcast, we sometimes try out different podcast art to see which one gets downloaded more. We do these. These A B tests.

Daniel [00:13:04]:
A B testing.

Hunter [00:13:07]:
Once you get a statistically significant sample size enough people have seen the experiment where, okay, we showed version A of the podcast art to 500 people and version B showed to 500 people and version A got downloaded 70% of the time and version B only got downloaded 10% of the time. At this point we determine, okay, A seems to be the version that performs better. And so we're going to go with A. And from there on out, we only show version A to maximize or optimize the probability that you are watching this show right now. In fact, people probably don't realize that they've been subtly manipulated into listening to us at this exact moment. They were probably AB tested their way in here. I don't know. Should we we disclose that?

Daniel [00:13:52]:
Welcome. Welcome to the Either A or B.

Hunter [00:13:55]:
Does that mean we cheated?

Daniel [00:13:57]:
No, that. That means that we found through experience what people's preferences are. And so that's the point that I was getting at, which is whenever they say, here's our secret experimental model, we won't even tell you whose it is. And then people immediately start asking it like, so what's the name of you model? Who trained you? Oh, well, I'm an experimental model from Google. Cool. Now we know what that is. But beyond that, every company, even I don't think. Are we a company? There's no, like, there's no great.

Hunter [00:14:27]:
There's vertical into people.

Daniel [00:14:28]:
Yeah, yeah. So even us, not even a company, we're doing AB testing. And every bigger company certainly is going to have AB testing of like on the Amazon homepage. Apparently there's dozens or hundreds of little AB tests going on literally every day. Facebook, gigantic for their marketplace, for the newsfeed, for everything. There's tiny little tests where you could not even notice that you're part of a test that's going on. Why wouldn't you have? If you've got the base model trained for llama 5 coming up next, of course, unless they get kind of axed. If you're going to be training this new version of this model and then you've got a three, you know, variants of it.

Daniel [00:15:10]:
This one's been tuned to be a little more helpful in this way. This one's been tuned to use more emojis in its responses. This one's been whatever. Why wouldn't you release all three and see what people prefer and then release that one as this is what everyone should be using.

Hunter [00:15:24]:
Is that why OpenAI has 25 models to choose from when you open ChatGPT?

Daniel [00:15:30]:
I don't know.

Hunter [00:15:30]:
What have you been in there? I know you kind of like said, I'm not going there, but do you ever just, you know, peek your head in to see what chat GPT looks like these days or like. No, it's a hard.

Daniel [00:15:39]:
No, no. I use chatgpt still sometimes. I just canceled my $20 whatever a month subscription. I wasn't using it very much. And you can still use it some for free, so I do. OpenAI though, is really doing their darndest to stay, if not necessarily at the absolute forefront of all these technologies, stay as the household name. Like I don't know if it's more momentum than anything else at this point, but they've got some new models, they've got frankly a few of them that all kind of came out at once. And I'm.

Daniel [00:16:15]:
I'm reeling a little bit from there's a mini this and there's O3 and we were just getting used to 01 and GPT 4.1 which came out after.

Hunter [00:16:24]:
GPT 4.5, which is now being deprecated.

Daniel [00:16:27]:
Which is already going away to make room for 4.1. The naming system is in absolute shambles. They're buying up other companies that do the exact same thing that's been available elsewhere. So we've talked about, like I always talk about Klein, there's another fork of the program called Visual Studio code called Cursor. That's one of the like big names in very integrated into your development environment.

Hunter [00:16:55]:
It's the one that I use.

Daniel [00:16:56]:
Right. And so OpenAI is going to be grabbing a company called Windsurf, allegedly reportedly going to grab a company called Windsurf for an absolute ton of money for like $3 billion.

Hunter [00:17:08]:
$3 billion? Yeah, they had them on one of.

Daniel [00:17:10]:
So this instead of making it themselves and then they released like a terminal coding tool that does what a bunch of the other things have already been doing for months and months. They're not at the forefront anymore, but they were first and they have enough mind share that I think OpenAI, slash, GPT, whatever at this point, like people have heard of that and we'll go to that first. And so they are going to keep, instead of necessarily, excuse me, innovating themselves, they will also buy up a bunch of other tools and become the sales force of AI. That that is my prediction is three or four years from now, OpenAI is absolutely still around. They are not necessarily making the forefront of anything, but they may still be doing that. That's just not really their main thing anymore. They are the AI company at that point and they will have bought up everything in verticals and horizontals and be the place that you go to to just have the easy to access suite of Everything. The Microsoft Office of AI.

Daniel [00:18:13]:
Despite the fact that Microsoft Office has a bunch of AI shoved into it these days. That is my prediction. I know we're not even halfway into 2025. I'm already going straight to the 2026 predictions at this point.

Hunter [00:18:24]:
That's, that's. I'm trying to think about that one. That. Okay, so OpenAI will become the sales force of a. I think it's a really interesting. Anything but they have everything really interesting idea. I don't want to believe it's true because I don't really love Salesforce. I mean it's great as a company, but from, as a technology and the.

Daniel [00:18:48]:
Right. But like if everyone already knows about OpenAI and you're like, I'm going to do an AI, I'll go to OpenAI, dot, whatever and then, well, I can do text chat. That was the first thing everyone already knows about that. I could do voice to voice. I'm talking to something with like voice. I could send over images and get those done. I could have IT generate images. I could.

Daniel [00:19:12]:
I don't think they have a video model yet. But speaking of predictions, I'm going to give it, I give it four months. No, no, no. Fine. Yes. All right. Sora, I was more thinking something more accessible. No, they made it available to everyone, right?

Hunter [00:19:28]:
Yes, they've made it available to everyone.

Daniel [00:19:31]:
Okay, fine. So they've got Sora. They do have. See, I'd literally forgotten about Sora because it was like so long ago in such a not necessarily state of the art release by the time we finally got our hands on it. But, but we've got image, we've got audio, we've got text, we've got video though. I had forgotten about it. They don't have music yet. Most folks don't have music.

Daniel [00:19:55]:
They have their own deep research mode. I know. Like, I think Anthropic just had a Google suite integration come out the other day.

Hunter [00:20:03]:
Yes.

Daniel [00:20:04]:
OpenAI's in bed with Microsoft, so we're going to see lots of integration with all of their stuff, more than we already have. Why wouldn't OpenAI be the place that you go to to do AI? They've already done a lot of it. They're about to buy for a ton of money a company that is going to help you on coding, which is one of the main things that people like to use this for. When are they going to buy Figma or whatever? Like one of the sites that helps you, you know, design out the UI for something or help design you Know flowcharts and whatever for. Are they going to buy Jira or Monday.com or make their own version of that where it's AI centric of, here's how you plan and iterate on your projects, here's how you design UI UX components. Why wouldn't they be that six months or a year from now, or three years or five years?

Hunter [00:20:56]:
Well, I think they wanted to buy Figma but when they learned they couldn't have dev mode. Did you hear that little, little debacle?

Daniel [00:21:03]:
Oh my gosh. Yeah. There's a term that. Hey, Hunter, what's Figma? We have to.

Hunter [00:21:10]:
Figma is a design application primarily focused on user interfaces. So whether that's a website or an app and it can be used in a broader sense than that, but that is its primary use case and it's web based, which was very novel at its time when it was introduced. Yeah.

Daniel [00:21:31]:
And Figma has what they call dev mode where you can take designs in a beautiful. I honestly really like Figma's like interface. I'm not a UI UX person at all. I'm. If I made it, it shouldn't be looked at, frankly, it should be used. No one wants to look at anything that I designed, like the images or outlines for. But Figma does a very good job of like, here's this beautiful design in dev mode, turns that design into code instead of just like here's what it should look like. Hey, programmers make that.

Daniel [00:22:06]:
They have a thing that turns it directly into code, which is neat. Now why is that a problem, Hunter?

Hunter [00:22:12]:
Well, another AI company called Lovable, which is one of the kind of. I'll call it a no code platform similar to v0.dev where you just go to it and you describe the application that you want to build and it leverages the large language models to build you something between a prototype and the actual application. All of it again starts with a prompt. Just describe your ideal program or the ideal screen in your program and it will build a prototype. That's probably a fair assessment. They might say it'll build the full application and they were sued or I guess sent a cease and desist letter, more technically speaking from Figma because they use the language dev mode in their app. Oh, now you're in dev mode because. Oh well, which is when they've finished completing the app and now you can edit it, which we bring it up just because it's a bit absurd.

Hunter [00:23:03]:
Although they do own the trademark site.

Daniel [00:23:05]:
And I don't see it. I don't see it with a circle. I don't see the C with the circle. I'm sure there's a few others out there. Maybe there is a filed trademark.

Hunter [00:23:12]:
They filed the trademark.

Daniel [00:23:14]:
But like, come on. I don't know. I. Some things I don't think need to be.

Hunter [00:23:18]:
Should we file some trademarks while we're at it?

Daniel [00:23:19]:
Absolutely. They honestly, they might be self aware. TM.

Hunter [00:23:22]:
TM.

Daniel [00:23:23]:
Yeah. DM.

Hunter [00:23:24]:
TMI.

Daniel [00:23:26]:
Now I've been dunking a little bit on OpenAI not in like their capabilities. The 4.1 model is a good flagship model. It's also expensive. They do have a new mini model which is supposed to be really good and you know, on the cheaper side. And OpenAI does a good job of running the gamut of. Here's our really expensive version that's well.

Hunter [00:23:47]:
4.1 is depending on the exact use case but it's like 25 times cheaper than 4.0 so that though it's not actually available in the chat GPT interface, you have to use it through the API. But if you. And also say OpenAI has a really good history of slashing their prices dramatically. I wonder how real it is because they also lose billions and billions of dollars every year. But it's still trying to get lost sooner.

Daniel [00:24:13]:
They're. They're keeping that mind share.

Hunter [00:24:15]:
Yeah.

Daniel [00:24:16]:
But as far as I can tell, none of the big players are making money at this time.

Hunter [00:24:19]:
But I want to circle back around again to the Salesforce idea. It aligns with another quote of Sam's where he, Sam Altman, the CEO where.

Daniel [00:24:28]:
He suggested around here, Sam.

Hunter [00:24:30]:
Oh, Sam. How's it going? You know Sam, Sammy. So he recently suggested that his goal is to make humans 10 times more productive and not to replace them. And I think that that concept does align with your Salesforce metaphor or dream of OpenAI where they're just this like building productivity tools for the masses. And while right now today the majority of them, though certainly not all of them, the majority of them are focused on the tech space that that just sort of continues to expand. And then they're the productivity company, the AI productivity company that is speaking of.

Daniel [00:25:13]:
OpenAI not being first to any particular party, he's not the first to talk about that 10x developer. That phrase has been around for a long time.

Hunter [00:25:22]:
But I really 100x developers too. That's. That's the new one people are talking about.

Daniel [00:25:25]:
Rolling my eyes real hard at that. I do think that there is something to AI as that 10x like I can take anyone who can already do code and make them tremendously more capable, Much more output coming out. You have to be careful to make sure that people don't just rely on it and don't learn and don't like gain any new capabilities themselves. Because I think that having those capabilities really helps drive the model to do the thing most appropriately. But still, yes, I believe that Sam Altman is onto something where he says, rather than replacing coders, we want to take existing people and make them ten times more productive. The question there is like, well, okay, if we can make coders 10 times more productive, do we only need one tenth as many? My answer to that's probably no. Even though we've seen lots of other cases of like certain jobs evaporating away because of AI, I do think eventually we are going to equalize across what that is for all industries, coding and otherwise. And humans are not solids.

Daniel [00:26:33]:
You look at your own hand and you touch your face, you feel, I'm solid, right? No, humans are gases. We expand to whatever space we've been given. And that absolutely counts at a meta, not the company. A company scale of like an organization will fill up whatever amount of time and space and money that it has. And so if you can be 10 times more productive, the company that you work for can make more things or get new more clients or do more of whatever it does. Unless it is already at like it's top of reach in the market, for example.

Hunter [00:27:07]:
Right.

Daniel [00:27:09]:
And I think that's how the company grows. And then you end up hiring more people because there's more work and more things to do.

Hunter [00:27:15]:
I think there's yet really interesting idea in there, which is that so as our productivity is increasing, we can now do more with less. The expectations of companies and of products will most likely increase with the exact same, I don't know, degree or acceleration. So just like we go back and look at a Nintendo video game that had a whole team of people working on it and the result was something that was a little bit lackluster based on today's standards. Right. Where today, if that's like the original.

Daniel [00:27:47]:
Nintendo, like if you look at like the earliest like a baseball game or a racing game compared to like the.

Hunter [00:27:53]:
Newest grand, a team of 20 engineers working on that. And then what would the expectation be today with a team of 20 video game engineers building a game, it would be wildly more technology increased, right?

Daniel [00:28:06]:
Yeah.

Hunter [00:28:07]:
It would seem impossible if we went back to Nintendo, that would take a team of a thousand engineers to build today.

Daniel [00:28:13]:
It would just be literally impossible like the, the horsepower wasn't there.

Hunter [00:28:17]:
Jobs didn't go away. The expectations increased because the technology allowed us to deliver more and that's, that feels more probable of what will actually play out. The expectations are just going to increase.

Daniel [00:28:31]:
I anecdotally, very much so here haven't been doing much in the way of machine learning model building in my day job lately I've been doing a bunch more like, like full stacky infrastructure building that hooks up to AI models and so on. But like things that are wildly outside of what my ordinary wheelhouse is because I'm able to do these things where I kind of know what I need and I've got the documentation and I've got ChatGPT or Claude or a Mistral model or whoever in front of me and I can use that to say I need to build this kind of piece of architecture. Here's what I know, here's what I don't know. Help me fill in the blanks. And I would never have taken a job that asked me to do those things a few years ago and I would never have wanted to do that on my own. But with the tools that have been available to me, I can do, dare I say, 10 times as many kinds of tasks as the sorts of tasks that I had been doing. And I will tell you if one of the things that I did in, I'm going to say about four business days, if I had decided I truly was going to sit down and do this and I was already competent at that task, we're looking at at least two weeks of work, probably three plus weeks of work. So at the very bare minimum, a 2x.

Daniel [00:29:56]:
Except I'm not that sort of developer and so I would have had to take months to learn all of the requisite technologies and get used to doing them to do a bad approximation of what I did in like 3 days, 4 days. That's that 10x promise is to take someone who already knows a bunch about something and then give them extra capabilities.

Hunter [00:30:20]:
And as you get those additional capabilities, the expectations of you as an engineer at a company are just going to increase as the expectations of the consumers of the product that your company is producing are going to increase. And perhaps we're not replacing any jobs, we're just building better products as technology advances, which is what has happened throughout history. Now before we wrap this, I have a bit of a conspiracy theory that I and I have not seen anyone point this out yet. So we're breaking. This is a straight from the mind of breaking news.

Daniel [00:30:55]:
I Can't wait to see the TikTok shortcut that we'll have the. The little, like, red ticker tape at the bottom.

Hunter [00:31:00]:
Breaking his conspiracy theory. So, as many of our listeners know and I know that you know, I am a member of the super elusive, secretive Pro tier of OpenAI subscribers.

Daniel [00:31:14]:
Oh, OpenAI, sorry. Right.

Hunter [00:31:15]:
Yeah, no, it's very exclusive. You have to have $200.

Daniel [00:31:22]:
But turns out most of the exclusive clubs in the world are really just money gated, actually. Think about it.

Hunter [00:31:31]:
Okay, so as an exclusive member, VIP signature club.

Daniel [00:31:36]:
Right. Do you have roads?

Hunter [00:31:38]:
I get access to certain models that the general public does not have access to, and one of Those is GPT 4.5. Now, this is. This isn't a secret. It was announced. It's out there. I've been using GPT 4.5 a lot, and I use it anytime I need to generate written content to be consumed by humans. It is my Claude 3.5 now hidden in the news of all of these model releases, and we already mentioned it very briefly. So OpenAI is coming out with all these new models, including 4.5.

Daniel [00:32:12]:
They're getting rid of 4.5.

Hunter [00:32:13]:
Is that. Yes, yes. Hidden in the news was, oh, and by the way, no big deal, but 4.5, we're gonna have to. We're gonna have to turn that one off because we need some GPUs.

Daniel [00:32:25]:
Yeah.

Hunter [00:32:25]:
And don't worry about it. Don't think about it too much.

Daniel [00:32:28]:
Just watch one last sunset.

Hunter [00:32:30]:
We're going to have to turn that one off. Now, when 4.5 came out, Sam tweeted about. Or X'd. He X'd, tweeted, I don't. That this model truly is something special. And he literally said, and we will never turn it off. It's in the tweet from when he released it. Now, I want to connect this with one news story that was in the news.

Daniel [00:32:57]:
He's turned off a lot of models in the past, and he said, on the record, we'll never turn it off.

Hunter [00:33:01]:
He's never turning it off. Now, I want to connect this with a news story. GPT 4.5 passed the Turing Test. It passed the Turing Test. That was the one that passed the Turing Test. The Turing Test is a test where a human is interviewing a model and has to determine, is this another human that I'm speaking with, or is it just an AI? And it's been around for a very long time. Named after the very famous Alan turing. And GPT 4.5 passed it.

Hunter [00:33:30]:
GPT 4.5 was indistinguishable from talking with a human in these tests. And now.

Daniel [00:33:38]:
No, that's not true. It was judged as human more often than the humans were.

Hunter [00:33:43]:
Okay, sure.

Daniel [00:33:43]:
Pass the more human than human more.

Hunter [00:33:46]:
I think that was a song.

Daniel [00:33:47]:
Yeah. So right there and then magically. Sorry, they did llama 3.1. They did an earlier GPT4. Oh, they took the original Chatbot Eliza from like the 19.

Hunter [00:34:00]:
That's a great.

Daniel [00:34:01]:
Or whatever it was.

Hunter [00:34:02]:
Still go back and use that one every now and then, by the way.

Daniel [00:34:04]:
Such a hoot. So they took all of those and also GPT 4.5. And GPT 4.5 was deemed to be the human in the human versus AI like comparison 73% of the time.

Hunter [00:34:16]:
Well, it was. But no more because they're turning off. You actually still have access to it. But I think Sam said maybe like six months more and then it's gone. Because some people have done some integrations with it and they need time to move to one of the lesser models. But how's that for conspiracy? Something's going on there.

Daniel [00:34:33]:
Well, maybe they're going to turn it off to the public because they're just going to put it in a robot and say now it's just some guy.

Hunter [00:34:39]:
I don't know, man. Things, things are happening.

Daniel [00:34:41]:
I watched her, you know.

Hunter [00:34:43]:
That was a great movie.

Daniel [00:34:44]:
Yeah.

Hunter [00:34:45]:
And if you want to see how this all plays out, there is one place where you will inevitably find out about a week after it actually happens.

Daniel [00:34:54]:
OpenAI the place where you go for all the AI stuff.

Hunter [00:34:56]:
No, they might be self aware.

Daniel [00:34:58]:
Right, right, right.

Hunter [00:35:00]:
The award winning AI news.

Daniel [00:35:03]:
Well, TBD but OpenAI generate an image of an award for us.

Hunter [00:35:08]:
Yes, that 4.0 model is incredible. And yeah, if you haven't already hit the subscribe button. We love all of our new listeners. We've been getting a lot of additional listeners recently and that's great to see. Send comments. We try to respond. Maybe not all of them, but most, most of them are like pretty great. There's a few, few, few comments out.

Daniel [00:35:27]:
There that might need some self awareness.

Hunter [00:35:30]:
Zero self awareness. Yeah. You can find us on every platform. Watch us on YouTube, listen to us on Apple Podcasts, Spotify or just wherever podcasts are available. Search where they might be self aware and you will find Hunter and Daniel twice a week on your feed giving you all the insights you can handle and then some.