Pages

Thursday, June 19, 2025

Me and my Ai #6: Scraping the Bottom of the Barrel

 

 

Me and my Ai #6: Scraping the Bottom of the Barrel


The big corporates driving generative Ai have a problem. To feed their need for data, to improve their GPT modelling, they’ve all already hoovered up all the easy-to-access data from the internet. That or they’ve taken a shortcut and simply hoovered up the modelling of other GenAi enterprises – as some US firms have suggested of Chinese firms.

It’s a thing, as Jonathan Shapiro highlights in his recent interview of internet analyst Brian Nowak. To survive, to stay competitive and keep eyeballs on sites, firms will have to invest more in gathering “first-party data.”

That data is increasingly unlikely to be found on the internet. What was there was scraped long ago, and the overwhelming majority of what’s new is the product of … generative Ai’s. Of a swathe of studies conducted this year, all conclude well over 50% of internet content is now Ai-generated. Shitting in the nest, pissing in the pool – are there any polite expressions for this sentiment? There’s not a lot of value in sampling the data generated from the data sampled earlier.

Instead, firms will have to invest more in gathering first-party data. A “first party” is corporate-speak for a human. Before they can sell it to us, the corporates need to gather the data. From us.

Having drained the internet swamp (and then pissed in it), they’re looking elsewhere.

I don’t know if it’s happened to you too recently, but my personal machine can no longer open an MS Office document without a yellow banner splashing across it from one side to the other, exhorting me to save all files (including the one just opened) to Microsoft’s cloud product, OneDrive. Where, I presume, in accordance with the tiny writing on page xx of the user agreement, that data will then be scanned to improve the user experience, as legitimate “first party data.”

(In response, I’ve started using Office Libre. Its Writer program informs me of the format of the files I’m opening. This week, as I started editing a document authored by an analyst from a leading consultancy, I was amused to discover it had been written using Word 2007. A version perhaps so old that Microsoft has abandoned it, leaving users banner free.)

Meanwhile, Google is scraping up first party data in video formats. Jess Weatherbed writes that users can have Google’s Gemini Ai summarise videos for them, without having to actually watch all those pesky, boring presentations and seminars themselves. While the text benefits the GPT’s evolution, at the same time, Google’s Ai-generated video avatars will be looking and sounding less like Scarlett Johansson.

If you’ve encountered other innovative (or corny, or even sad) attempts to harvest human (sorry, first party) data, please tack them on in comments.









No comments:

Post a Comment