“the secret list of websites”

The Washington Post does research to figure out which websites were used to train Google’s AI model:

To look inside this black box, we analyzed Google’s C4 data set, a massive snapshot of the contents of 15 million websites that have been used to instruct some high-profile English-language AIs, called large language models, including Google’s T5 and Facebook’s LLaMA.

Inside the secret list of websites that make AI like ChatGPT sound smart

My largest corpus of writing to date is on the web at css-tricks.com (along with many other writers), so naturally, I’m interested in seeing if it was used. (Plus, I’ve been seeing people post their rank as a weird nerd flex, so I’m following suit.)

CSS-Tricks.com ranks 8,182 on the websites used to train Google's AI model.

Despite Google’s employees serious misgivings (just little stuff like the information presented leading to “serious injury or death”), Google has publicly launched their Bard tool and are very serious about investing in AI.

Me, I just think it’s fuckin’ rude.

Google is a portal to the web. Google is an amazing tool for finding relevant websites to go to. That was useful when it was made, and it’s nothing but grown in usefulness. Google should be encouraging and fighting for the open web. But now they’re like, actually we’re just going to suck up your website, put it in a blender with all other websites, and spit out word smoothies for people instead of sending them to your website. Instead.

And while doing that, they aren’t:

  • Telling authors their content is being used to train
  • Telling users where the output came from
  • Offering any meter of how reliable or confidently correct the output is

So, I’m critical. It’s irresponsible.

But I’m not a neo luddite or whatever on this. It’s all certainly interesting. I like that these tools are almost immediately useful and pouring over with use cases. Heck, I needed a quick CSS rainbow gradient the other day, and the output from Bard was quick and useful. I’m a GitHub Copilot paying customer and I’m 100% sure it makes me a faster and better coder. I’m nervous about lots of things related to (massive air quotes) “AI” but I’m hopeful it can do some good.

On being critical though, here’s Manuel Moreale:

… I do enjoy reading news and discussions when politics and technology are both involved. I especially enjoy reading people’s perspectives on these topics. One thing I’m noticing more and more though, is that most people are quick to point out what’s wrong about something, but almost never offer solutions or alternatives.

And that is because complaining or pointing fingers is the easy part. Figuring out alternatives is hard

Criticising is the easy part

So here’s what I’d like to see done:

  • Stop firing ethics people. What is it, three times now?
  • Be very open about what content a model is trained on, and at least allow people to opt-out. Better — opt in.
  • Credit and link to the sources directly in the output where possible.
  • Operate this part of the business as profit neutral.


I work on CodePen! I'd highly suggest you have a PRO account on CodePen, as it buys you private Pens, media uploads, realtime collaboration, and more.

Get CodePen Pro

3 responses to ““the secret list of websites””

  1. Matt says:

    I particularly like the idea of crediting sources where it’s feasible. You shouldn’t have to rely on digging from the Washington Post to see if your work was used for training after the fact. Bit too much of a haveibeenpwned.com vibe for my comfort.

  2. Seirdy says:

    I added an entry to my robots.txt to block ChatGPT’s crawler, but blocking crawling isn’t the same as blocking indexing; it looks like Google chose to use the Common Crawl for this and sidestep the need to do crawling of its own. That’s a strange decision; after all, Google has a much larger proprietary index at its disposal.

    A “secret list of websites” was an ironic choice of words, given that this originates from the Common Crawl. It’s sad to see Common Crawl (ab)used for this, but I suppose we should have seen it coming.

    I know Google tells authors how to qualify/disqualify from rich results, but I don’t see any docs for opting a site out of LLM/Bard training.

    (POSSE note from https://seirdy.one/notes/2023/04/21/opting-out-of-llm-indexing/)

  3. FWIW my understanding of the C4 model is that it’s not Google’s, but an independent crawler foundation’s: https://commoncrawl.org/

    (Made this mistake myself after reading the Post’s article and then had to correct)

    Or am I missing some secret affiliation?

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to Top ⬆️