@vrighter@ylai
That is a really bad analogy. If the "compilation" takes 6 months on a farm of 1000 GPUs and the results are random, then the dataset is basically worthless compared to the model. Datasets are easily available, always were, but if someone invests the effort in the training, then they don't want to let others use the model as open-source. Which is why we want open-source models. But not "openwashed" where they call it "open" for non-commercial, no modifications, no redistribution
I would consider the "source code" for artwork to be the project file, with all of the layers intact and whatnot. The Photoshop PSD, the GIMP XCF or the Krita KRA. The "compiled" version would be the exported PNG/JPG.
You can license a compiled binary under CC BY if you want. That would allow users to freely decompile/disassemble it or to bundle the binary for their purposes, but it's different from releasing source code. It's closed source, but under a free license.
It would depend on the format what is counted as source, and what isn't.
You can create a picture by hand, using no input data.
I challenge you to do the same for model weights. If you truly just sit down and type away numbers in a file, then yes, the model would have no further source. But that is not something that can be done in practice.
I think technically, the source should be the native format of whatever image manipulation program that you use. For vector graphics, there is svg format but the native editor is still preferable. Otherwise, whoever gets the end copy cannot easily modify or reproduce it, only copy it. But it of course depends on the definition of "easy" and a lot of other factors. Licensing is hard and it is because I am not a lawyer.
The situation is somewhat different and nuanced. With weights there are tools for fine-tuning, LoRA/LoHa, PEFT, etc., which presents a different situation as with binaries for programs. You can see that despite e.g. LLaMA being “compiled”, others can significantly use it to make models that surpass the previous iteration (see e.g. recently WizardLM 2 in relation to LLaMA 2). Weights are also to a much larger degree architecturally independent than binaries (you can usually cross train/inference on GPU, Google TPU, Cerebras WSE, etc. with the same weights).
How is that different then e.g. patching a closed-sourced binary? There are plenty of community patches to old games to e.g. make them work on newer hardware. Architectural independence seems irrelevant, it's no different than e.g Java bytecode.
This needs to have multiple levels of "openness" to distinguish between having access to the code, the dataset, a documented training procedure, and the final weights. I wouldn't consider it fully open unless these are all available, but I still appreciate getting something over nothing, and I think that should be encouraged.
The public benefit biz is embarking on a global series of workshops to solicit input from concerned parties on its Open Source AI Definition, which has been under discussion for the past two years.
There's concern that the legal language in existing OSI-approved licenses doesn't necessarily suit the way the machine learning models and datasets are used.
Terms like "program," when applied to machine learning models, refer to more than just source code and binary files, for example.
"AI is different from regular software and forces all stakeholders to review how the Open Source principles apply to this space," Stefano Maffulli, executive director of the OSI, explained in a statement.
The workshops will take place at various upcoming conferences in the US, Europe, Africa, Asia, Pacific, and Latin America through September.
Bruce Perens, who drafted the original Open Source Definition, told The Register that he was skeptical about the need to address AI separately.
The original article contains 420 words, the summary contains 154 words. Saved 63%. I'm a bot and I'm open source!
i feel like it's okay that they do this, but i don't like the term "source available". maybe something like "Free for Non-Commercial Use" or "FOSS-NC"?
The free software banshees will call it all proprietary… It’s not that it doesn’t make sense to draw different lines, but when folks treat OSI with a lot of reverence & if they say it doesn’t match their definition, folks want want to use it or release under these titles. “Source available” is also roped in with the we-get-a-monopoly licenses & gets knocked down a peg as if “open source” is the pinnacle of freedom despite the Commons being ransacked by corporations not giving back monetary support or contributions for the labor.
Years ago I found myself explaining to Chinese Room dinguses - in a neural network, the part that does stuff is not the part written by humans.
I'm not sure it's meaningful to say this sort of AI has source. You can have open data sets. (Or rather you can be open about your data sets. I don't give a shit if LLMs list a bunch of commercial book ISBNs.) But rebuilding a network isn't exactly a matter of hitting "compile" and going out for coffee. It can take months, and the power output of a small city... and it still can't be exact. There's so much randomness involved in the process that it'd be iffy whether you get the same weights twice, even if you built everything around that goal.
Saying "here's the binary, do whatever" is honestly a lot better for neural networks than for code, because it's not like the people who made it know how it works either.