They Might Be Self-Aware

Understanding Open Source in AI: The New Definition Explained

Episode Summary

They Might Be Self-Aware Podcast (TMBSA) - EPISODE 47 SMASH THAT SUBSCRIBE BUTTON, YA OPEN SOURCE MAVERICKS! This week, we're peeling back the layers of what "open source" really means — spoiler: it's more lock and key than open sesame these days. Are AI models truly open source, or is it just make-believe marketing magic? Should companies play benefactor and pledge cash for open-source support, or does that just miss the whole open-source ethos? And what about that trusty robots.txt file — is it your website's knight in shining armor, or just the invisible cloak everyone ignores? We also debate those "Do Not Train" lists for AI models — ultimate privacy fortress, or a mirage in the desert of data mining? Plus, what happens when you dump your art into the internet void and it ends up training Skynet? Cue the existential dread. Just another wild ride here at They Might Be Self-Aware! For more info, visit our website at https://www.tmbsa.tech/

Episode Notes

00:00:00 - Intro
00:00:22 - The Complexity Of Open Source Software
00:02:12 - Challenges In Reproducing Open-Source AI Models
00:11:49 - Monetary Support For Open Source Projects
00:19:00 - The Role And Relevance Of Robots.txt
00:23:06 - The Ethics Of 'Do Not Train' Lists
00:25:54 - Wrap Up

Episode Transcription

Daniel [00:01:36]:
Hunter. What is open source software?

Hunter [00:01:41]:
Open source software is software where the source code is available for anyone to view. It's often also free, but not always. I guess it's free to look at, but not free to use. Yeah, sometimes you can use it for commercial purposes, sometimes you can, sometimes you can, but you have to. Any derivations that you create, you must also open source. There's all sorts of different licenses, but I guess the key thing is you can review all of the code.

Daniel [00:02:12]:
Okay, yeah, I think that's a definition that pretty much everyone would be able to agree on is despite all the sub licenses. And this is. And that's open source means you're allowed to look at the code. I've seen a bunch of papers in especially the NLP space because that's kind of where I'm natural language processing, the intersection of computers and language. Because that's my vocation, that's my bread and my butter. There are a lot of papers out there that are at least ostensibly supposed to be able to be followed. Like you could recreate the outcomes that the researchers had done.

Hunter [00:02:49]:
They also have a GitHub source code.

Daniel [00:02:50]:
Here's some source code. So I guess that makes it open source. But then I think fairly importantly for especially the research community, you then have. And here's the data that we trained on or at Least here's the name of the data and that might be a paid set of data somewhere. Like way back in the day, for example, there was the Penn Tree bank and that was a whole bunch of manually constituency treed out in part of speech tagged text that you could get that was really expensive, but you could get it. You know, a lot of educational institutions had them and so you could have open source software. And here's a paper and you are able to recreate our results if you follow these steps. But you do need this data for it.

Daniel [00:03:38]:
And I do think that that still is open source. Like you're able to take a look at that and sometimes the weights are available, like great. I can just use the end product and that's nice and you can build things with that. But I feel like there needs to be a different definition of open source where it's not just you can see the source, but you, you, anybody without having to pay money to get behind some paywall can get everything necessary to recreate the results in the paper or of the claimed, you know, whatever it is on, on benchmarks on the website is here's the code to do it and then here's the data that we trained on and the training scripts and so on. Like a truly, truly, truly open source. Open source with, with sources, open source and source.

Hunter [00:04:26]:
So some papers also will distribute their model weights, but not the data. And I guess that's better than no model weights because then you can at least run the examples and play with the tech a little bit. Yeah, but you don't have true reproducibility because you don't have the inputs that they have.

Daniel [00:04:45]:
Right. Or, and I've seen this in more recent examples is here's the code, here's the weights, and here are fine tuning scripts. Because large language models are so large the average person's hardware can't recreate those original results. And so they say we packaged up that thing that needed a zillion GPUs into a thing. Here it is. And then here's something that can lobotomize that and recreate that last layer, for example, for your own use cases, which is very, very useful. But it isn't truly. Someone could, if they had the hardware, actually recreate it themselves.

Hunter [00:05:24]:
I've also seen some discussion recently about companies leveraging open source in their marketing material without any real intention of open source. Oh yes, it's open source, but so much of it's missing or it's such a crippled version compared to the commercial version that they sell that it's we're violating the idealistic notion of open source, the dream, the ideal, and bastardizing it to a certain degree. But I guess the Open Source initiative is to the rescue because they have a new official definition.

Daniel [00:06:02]:
Well shoot, I'm glad somebody came up with something. What is it?

Hunter [00:06:05]:
Well, I actually don't know their full definition, but the biggest change is if you want for a machine learning model to be labeled as open source, it requires the distribution of the training material as well, or at least per them, and they're an industry leader. You can't label it as an open source. We talk about open source AI a lot. It's not open source if you can't reproduce the full thing. Most of the open source models that we talk about when we say open source is really just that you can run it on your own hardware.

Daniel [00:06:36]:
Right, right, yeah, exactly. So OpenAI has GPT4among other models and you're not allowed to download GPT4 and run it yourself on your hardware, which is good, but you're also not allowed to.

Hunter [00:06:51]:
Crazy. Like we wouldn't say like, oh, Grand Theft Auto is open source because I can run it on my own computer and technically I can mod it a little bit like that is a bastardization of the term. It is not open source yet. We've sort of accepted open source AI as being, well, I can run it on my own computer, which is super cool. And I don't want to dissuade in any way. It's just probably not the right term.

Daniel [00:07:14]:
That's a really interesting way of looking at it. I suppose if there is like a set of the LLAMA code out there that you're able to look at. But you're right, most people are going to download a sort of pre compiled set of weights that they're going to put into one of various front ends that themselves are possibly open source or not, and then run a local large language model as opposed to like I'm looking at the back propagation algorithm for how LLAMA actually does it. Like very few people care about that specific thing. But okay, so this open source initiative is saying you've got to give access to the training data, to the full code, to the model settings, probably even to like the random seed that things were started off of. The idea being like you should be able to hit play on this thing and actually recreate it. But especially for the really big models, it isn't necessarily just trained from hit go wait for 300 million GPU hours and then model comes out, they're often stopped at individual checkpoints. Sometimes weights are changed partway through.

Daniel [00:08:21]:
And sure, you could recreate that as well. But the point is, this is not like taking a bag of popcorn, throwing it in the microwave, hitting go and then popcorn comes out. So now we're moving farther away from the simpler, it just is open source into. Well, there's all these caveats and so on.

Hunter [00:08:40]:
It's another good point. As with a lot of these models, they aren't as deterministic as you would like them to be. You can run it 100 times, you'll get a slightly different version 100 times. That's right. And I guess there's some potential of also including every random number that was used throughout it to somehow seed it to get the exact same output time after time. But yeah, it's not an easy problem to solve. Not to mention all of the issues with the data that we know all of these models are training on.

Daniel [00:09:13]:
Yeah, not knowing where this data comes from is a feature and not a bug. I think whenever some companies start getting into some hot water of oops, it turns out we've been looking at a lot of copyrighted material.

Hunter [00:09:27]:
For example, we know it has been decided that use Metta as an example, they can't distribute every article from the New York Times over the last 10 years. That's against the law. That would violate the copyright law of lawsuits for this. Yes, but there is currently a belief, a disputed belief, but a belief that they can distribute the final weights that were trained off of that material that was copywritten. So getting to a point of that.

Daniel [00:09:56]:
Same set of legal problems ongoing. Yeah.

Hunter [00:09:58]:
Where models, where there is even an opportunity for this is challenging. It probably exists more in that fine tuning sector that you were talking about, where, okay, I'm starting with this foundational model like Llama, and then I've built up this training set that I'm using to fine tune it and the potential for that being capable of being fully open source where I can actually review all the material. That's plausible at least. But everything, I don't know.

Daniel [00:10:27]:
Talking about what is or isn't open source does make me also think. We've talked about generative AI and the possibility of putting people, especially like artists, for example, out of jobs. And that's, in my opinion, a bummer. Maybe not everyone's opinion, but it certainly is mine.

Hunter [00:10:44]:
I know you and I, art careers. They're both suffering.

Daniel [00:10:47]:
My art career. Yeah, absolutely. Why not apply the same sort of discussions that we've had, which Is, is it UBI is everyone who contributed to the model should be paid a little bit based off of the money that's coming in for the services based on these models. Like, I think a more equitable end sort of solution should presumably support the people who made the thing that is being available to other people. And so for open source projects, a lot of the people who make these things have like, hey, if you want to buy me a coffee, here's a link to send me a couple of bucks on PayPal. And presumably very, very few people actually maintain like those sorts of best practices. Should we call it of I will, you know, actually pay for winrar for example. But in San Francisco very recently, there is the open source pledge group that launched earlier this month in October 2024 to data here.

Daniel [00:11:49]:
Some companies have signed up for this group and what they are saying is that if a company uses open source code, that they should pledge $2,000 per developer to support projects that develop open source code. And I think at least at sort of a lofty kind of idealized way, I really like this. A company that is making money based off of the efforts of people who made something freely available, it does sort of go to say, at least as far as my ethical compass is concerned, should give something back. Technically, those people made it available for free. I understand that. But like, if you aren't living paycheck to paycheck, you're not going to be stealing your Netflix anymore. You're going to sign up and actually pay the money for it or Spotify or this or that. Some people might always pay for, some people might always steal it.

Daniel [00:12:41]:
But. But the idea is as you become less financially insecure, certain things that maybe you wouldn't have paid for, you say, you know what, I will buy winrar after all of these years, or whenever I'm not buying.

Hunter [00:12:57]:
Are you trying to sell me Winrar right now?

Daniel [00:12:59]:
As I understand it, JavaScript has a Gaolian libraries that underpin a gajillion other libraries. And there's like one maintainer in North Dakota somewhere who if he were to stop working on that one thing that he's been maintaining for years, all of modern infrastructure collapses. There's a famous XKCD comic about this and something like this has happened where somebody said, I'm no longer going to maintain this JavaScript package. And then I think like sold the rights to a Chinese company which immediately turned it into malware. What if you could make a living out of. I made this really important open source piece of software and I get point, whatever percent of money that's being given by companies to this open source initiative that then doles it out based on some sort of percentage, you know, allocation based on what your contribution is. I feel like that's a nice.

Hunter [00:13:56]:
But you just, you then you create a company. Like you're not, don't. You're not. I'm not an open source project. And you know, so first of all, I think it's a great idea if companies want to contribute back. It makes sense. They're leveraging, they're getting a ton of value from these projects, help these projects continue to improve. I like the angle more when the companies allow their employees to work on the open source projects and contribute back to them in that way.

Hunter [00:14:20]:
There's something purer about that type of contribution to me rather than okay, here's $2,000 to do what you want because it's not, it's not a commercial venture. And I think that there's purpose, right. Some value in keeping it separate that this. Commercial ventures have to answer to commercial interest. Open source initiatives don't. They can try to create the perfect solution or one person's idealized solution outside of commercial interests. And so once we bring those commercial interests back in, it corrupts it a little bit for me. Again, I'm not saying don't do that.

Daniel [00:14:59]:
Even if that software is being made to make paid software because again, there's a lot of different licenses. But if this is a truly, truly, truly, truly open source license, it's just provided there's no license, no anything, use it at your own risk. But then someone makes money off of that. I guess you're saying like the person made it open source, I.e. them washing their hands of any monetary.

Hunter [00:15:22]:
Yeah, they wanted, they want to make money off it. Yeah. Create a commercial version of it or create a company that offers it and customizes it for money. But don't get into the open source game in order to solicit donations from everyone and think like, well, that's how I'll make a living. Like there is a way to make a living with software. And your software can be open source.

Daniel [00:15:44]:
I do it.

Hunter [00:15:45]:
Yeah, you can have your code be freely available, but I just wouldn't, I wouldn't marry the two. Like it's an integral part of open source is the continued economic support from all these people. I think it can exist independently of that and no shunning of it. Great. I think it's incredible if a company is sending millions back to these projects and helping these People out. Incredible. I just don't think it should be in any way required. And I don't want to see the tip screen every time I clone a repo off of GitHub.

Hunter [00:16:18]:
How much would you like to tip this person? Yeah, and I think that by doing that it corrupts it a little bit. Yeah, people would get tips.

Daniel [00:16:26]:
So to then extend that back into the other spaces with like large language models and other generative AI type scenarios. If you have art that you've made, for example, and it's on, you know, DeviantArt or any other website, if you've made something freely available for people to look at online. I guess actually where I'm really getting with this is there isn't an equivalent of open source for someone writing a comment on Reddit and there's not an equivalent of open source or there's Creative.

Hunter [00:17:06]:
Commons licensing, I think.

Daniel [00:17:07]:
Right? Well, yes, and that's usually part of the terms of service of like the thing that you signed up for. And that disconnect is I think, why a lot of people get upset. But if you make like I drew this picture and I put it online and then it gets sucked up into some, you know, AI art generator, I think you can legally get upset if there was a, a license attached to that of like this is a by attribution or non commercial use or this. Or this. Or this. Of course, you then have to prove it in court, which is I think part of a lot of the ongoing legal cases against these companies.

Hunter [00:17:43]:
Right. And the current case law is that you are allowed to consume that image as long as you didn't have to sign into a service where you may have accepted a terms of service and entered into a legal agreement. But assuming it's available on the public Internet, you are allowed to download it and get it. You are not allowed to then necessarily freely use it unless they give you those rights in some sort of license.

Daniel [00:18:07]:
Right.

Hunter [00:18:07]:
And then in the AI angle, at this point, we're saying, suggesting you are able to use it in a generative form, you're able to generate things off of information gleaned from it, you just can't recreate it.

Daniel [00:18:18]:
I don't know, maybe more restrictive or specific licenses could possibly help out with that of not just someone saying on their website, I do not allow you to download these and use these. It would be whether or not you can download them. You know, the license attached to this. Maybe there has to be some sort of case law, some sort of something where you can say the metadata on this Image states you're allowed to look at it, but you're not allowed to train something off it. And I guess that is dependent on the existing set of cases that are working their way through our system right now to go in favor of the artists instead of the companies that are basing things off of the stuff that they downloaded.

Hunter [00:19:00]:
Right. And we have a system of doing that today. It's kind of the honor system. But there's this file called robots. Txt which most websites being ignored plenty. Yeah, yeah. So if pick a random website that you like, go to that website,/Robots TXT and you'll probably find a mysterious file that describes how they would like primarily search engines, but increasingly large language model scrapers as well to interact with their content. And so people have developed a language that they can put in there that the machines can follow.

Hunter [00:19:35]:
However. Yeah, they aren't really following it. Perplexity perhaps being or certainly one of the most noted offenders. I don't know if they're the worst offender, but one of the most noted of. Oh, oh, we, we outsourced that part and they must have forgotten to. To read the robots. Txt file. But wait, don't worry, it's in our backlog somewhere.

Daniel [00:19:55]:
We're going to get around not ddosing a bunch of sites that we're scraping stuff from potentially illegally.

Hunter [00:20:02]:
Yeah, it's less about the ddosing and more just training on all of their material or taking their material back and presenting it to their users without them coming to their site.

Daniel [00:20:11]:
There's a lot of asking for forgiveness well after the fact. And once your product is out and you're making money off of it instead of are we allowed to actually use this data? And that kind of then comes back to the open source discussion of if the law were to eventually move in this direction, we would then presumably need to be able to see whether it be behind like closed doors with legal briefings of someone somehow printed up, you know everything or maybe it's just a big thumb drive. Here is Everything that open AI's newest model was trained on text files as opposed to an open source model of you're allowed to see literally everything that went into this. I think one of the reasons why especially the large language models aren't open sourcing. It isn't just the fear of where some of that stuff might have been and how copywritten it is. It also could be like what if they have bought and paid for a bunch of other data sources too? And that bought and paid for set of data sources could have been pre processed to remove personally identifiable information. For example, let us suppose you are a data broker and you have a zillion pages of let's not say medical. It's, it's like the white pages or yellow pages type folks.

Daniel [00:21:32]:
You can get by a copy of like yellow pages information. And it's just a zillion you know, businesses and a bunch of metadata about them and addresses and phone numbers and so on. Presumably I'm going to make a guess here that OpenAI and Mistral and Anthropic and so on among a great many other data sources bought and used that but then they just anonymize the information that would be legally unpalatable to surface out. So like I can't, boy, I'm going to ask ChatGPT and I sure hope it's not going to be like and here's Daniel's phone number. But like that's presumably the kind of thing that when they ingested that they said hey, personally identifiable information, obfuscate or remove that. But it still could be part of a paid piece of data that they had. And so to be open source you would have to say here is all of the stuff that isn't necessarily in the form that made its way into the training regimen of the training.

Hunter [00:22:31]:
I saw a prompt on Twitter, I saw a prompt on Twitter that was basically, it was prompting the LLM to create a CIA dossier on you based on all of the information that it knows about you. And some then people were posting some of the results which were pretty, some of them were insightful. I tried it, I didn't get anything back. But give it time. Now if you were able to go and elect that no AI can ever train on anything that you have generated, would you, would you sign that, you know, do not call list that the do not train list? Would you put yourself on it?

Daniel [00:23:06]:
Yeah, I think I would. I don't see any particular reason. The, the reason why I say I would put myself on the do not train list is I do think that people need to have some sort of right to privacy, especially in our increasingly able to be spied upon electronically or otherwise sold.

Hunter [00:23:24]:
I think here's the.

Daniel [00:23:25]:
I want some sort of privacy at some point.

Hunter [00:23:28]:
Here's the rub though. What if when you put yourself on that list you're then not able to use any models that are leveraging all of the world's every individual's data. So yeah, an artist, you can keep your art out. Individual, you can keep your words out. New York Times you want none of your articles to ever be trained on. You can do that, but you can never use any models. What is true for you must therefore be true for everyone.

Daniel [00:23:54]:
Interesting. So me saying like I want you to train on me. Well, isn't this frankly what happened in the EU with Apple very recently where they said we've got AI Apple intelligence now and eu you're not allowed to have it because you have rules about privacy. I think that's literally the thing that you just said. Right? They're not allowed to use it because they aren't allowed to be trained off of.

Hunter [00:24:21]:
Yes, that's certainly part of it. But I'm also trying to chatgpt by.

Daniel [00:24:25]:
The way, and it says it doesn't have access to personal details like phone numbers. Maybe I didn't ask the right way, but at least a base search of like here's my name and a couple of facts about like my past. Do you know my phone number? It says no.

Hunter [00:24:36]:
It brought it back and then the regex filtered it out probably. But so similar. But I think it goes back to the open source question a little bit in that. So as a consumer of open source software, do you have some sort of obligation in order to support, to give back? As a consumer of these quote unquote open source models or freely available models, do you have an obligation, again certainly not a legal one at this point in time, but a moral obligation to submit your personal identity to them because you are reaping the benefits of so many others personal identities?

Daniel [00:25:14]:
No, I still don't think so.

Hunter [00:25:16]:
I'm reaching.

Daniel [00:25:17]:
Right? You're basically taking that same argument of shouldn't I have to at least give it money or something instead and saying okay, well what about your information? Because I get data's the new oil. You know, I do suppose that you've now talked me back around to it is if you do make something truly open source with the appropriate licenses of like literally anyone's allowed to use this. Yes. For commercial purposes. No, for not that. Like yeah, just adhere to what that is and that's why they chose that license. And if they said this is really available for anybody to use no matter what, this is my contribution to humanity. All right, there you go.

Hunter [00:25:54]:
Now just imagine if you make something that is truly self aware. There's no telling, but we'll still try to tell you on perhaps even the next episode of they might be self aware. You also mentioned Apple Intelligence. Let's dive into that on the next episode because it's out now and it sucks or I just like haven't gotten into it enough. I'll give you an update, but let's talk about that next episode, okay? Speaking of next episode, you want to know about it? Hit that like button or subscribe or drop a comment so we can come back up and tell you what is going on. We're on all the major platforms, whether it's YouTube to watch us, Apple podcasts to listen to us, maybe a little Spotify, maybe your favorite podcast directory and client. Just type in, they might be self aware and you will find us because we are here for you and you've been listening too.