It’s the Data, Stupid

Dan Geer came across this marketing thingy and sent it over. It serves to remind us that when it comes to ML, it’s all about the data.

Take a look at this LAWFARE article we wrote with Dan about data feudalism.

Welcome to the era of data feudalism. Large language model (LLM) foundation models require huge oceans of data for training—the more data trained upon, the better the result. But while the massive data collections began as a straightforward harvesting of public observables, those collections are now being sectioned off. To describe this situation, consider a land analogy: The first settlers coming into what was a common wilderness are stringing that wilderness with barbed wire. If and when entire enormous parts of the observable internet (say, Google search data, Twitter/X postings, or GitHub code piles) are cordoned off, it is not clear what hegemony will accrue to those first movers; they are little different from squatters trusting their “open and notorious occupation” will lead to adverse possession. Meanwhile, originators of large data sets (for example, the New York Times) have come to realize that their data are valuable in a new way and are demanding compensation even after those data have become part of somebody else’s LLM foundation model. Who can gain access control for the internet’s publicly reachable data pool, and why? Lock-in for early LLM foundation model movers is a very real risk.

It’s the Data, Stupid

0 Comments

Leave a Reply