• https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning/

    Good news, LFGSS is still in the top 50K websites on the whole internet.

    Bad news, LFGSS is included in the training models of nearly all chatbots because we're in the top 50K websites on the whole internet.

    Do we care?

    Would you prefer your words were not used to train AI bots? Or are you OK with this?


    1 Attachment

    • Screenshot 2023-04-20 082844.png
  • How else will they learn when to use the word cunt?

    Are they easy to ban?

  • Are they easy to ban?

    Everything is easy to ban

  • The caveat to "it's easy to ban" is that there are side effects.

    For example today someone likes Google sends a web spider to crawl the page and that single scrape is then used to:

    • Index the page
    • Provide a search index
    • Train the AI
    • Attempt to establish (with little context) advertising useful profile and intent
    • Examine HTML and tech of websites
    • etcetera

    Microsoft Bing has recently massively increased the amount that they crawl websites, observed them doing a 10x increase on LFGSS. So I strongly suspect they're doubling down on AI and seeking to train things too.

    If we block the aggressive crawling then we gain more privacy, lower costs (or rather the heavy crawling may increase costs over time as it's doubling our daily volumes), your content remains yours and isn't used to train AI.

    But the downsides are that we're going to start disappearing from search indexes. But that's OK IMHO, as this site is old and established and we're not desperately trying to increase awareness, plus not being in search indexes has the advantage of not attracting spammers.

    That's the wider context, banning nearly all bots will hurt us (not being in search indexes) but perhaps that's an upside?

  • Ban

  • Does this mean we'll drop down the rankings on the Rod Liddle is a Cunt searches?

  • Disappear from the rankings.

  • I would err towards block but I do regularly use google and/or duckduckgo to do targeted searches on lfgss.com when I can't find what I want using built-in search.

    Would definitely block Yandex regardless of AI.

    If it was my site I'd maybe leave duckduckgo alone but it's not my site.

    Actually not even sure I've ever seen a duckduckgo crawler in logs. But it's been a while since I've dealt with web server logs or SEO questions.

  • If it was my site I'd maybe leave duckduckgo alone but it's not my site.

    This is actually Microsoft Bing.

    DDG don't really have their own index.

  • Actually not even sure I've ever seen a duckduckgo crawler in logs

    That would be why then.

  • Disappear from the rankings.

    But how will he know he's a cunt?

  • Isn't it better if the AI's learn from good sites like this one as well as, eg, Mumsnet.

  • But how will he know he's a cunt?

    If he doesn't know by now, he likely never will.

  • Isn't it better if the AI's learn from good sites like this one as well as, eg, Mumsnet.

    Arguably yes, but there really isn't "learning" here, it doesn't "understand"... it makes a reasonable guess based on the probability of one word following another based on the statistics of what it's seen. The result is what appears to be viable and likely output confidently given, but there is no "expertise" because it hasn't actually learned, it's just mimicked.

    What it mimics is based on what it was trained on, and means that your posting on websites which form part of the training set is the translation of your effort into free labour to train AI.

    For me it's that part that rankles, not so much the quality arguments as with a big enough training set quality will be achieved anyway, but whether our interacting with other people, other humans, should form a source of free labour for machines and megacorps to profit from.

    Thus it ever was is one perspective on that, but we can also take a tiny win and ban bots from this site to satisfy ourselves that at least the primary purpose of this site is for each other — a more human centric approach to the time we spend on here.

  • Would be interesting if you could only find this by word of mouth (or sticker spotting), but then being ‘findable’ brings more cycle peoples which has got to be good no?
    Escaping ChatSHIT and the like now feels like the horse that bolted…
    …but keeping the bots at bay sounds good

  • Not long ago if you asked chatgpt if it should exist it says no and quoted a single comment from hackernews. It's going to have some really dodgy data if it's using us and other community sites as a source of truth

  • Ban.

    Totally agree on the free labour aspect. That's long been the case with people posting on anti-social media sites, where they are effectively unpaid employees, probably because the EULAs they clicked away when they joined included their consent to data exploitation.

  • Thanks for the additional insight Velocio - it's not my area of expertise so it's interesting to read more detail and know more about the implications. Should you pin this to the front page to make it more visible?

  • The robots will find us eventually.
    I've always been a fan of AI, just for the record.

  • As a dissenting voice, I'm not too fussed about generally available forum information being crawled for AI training. Members only section is already excluded, so there's a mechanism for people to keep things more private if they want?

  • Agree with this, unless the increased traffic makes lfgss unsustainable

  • I vote no ban, but have a forum specifically set up to poison the well.

  • Less flippantly - I'm not completely opposed to bots, however the increased scope of what they are doing is somewhat troubling.

    They are, after all, using the information that we have provided. I guess there is an argument that the provide a service in return (search rankings etc...), but there is no explicit arrangement there.

    I also think that individuals need to be a whole lot more aware of what the are writing, not just here, but all over the web - the aggregation of (almost literally) all things written on the web is heading towards the trivialisation of linking profiles / people / opinions together.

    What's the answer then? Ban for the moment, until the water is less muddied and the balance of information is less asymmetric?

  • If it's costing us (you) money and it's just likely to help MS and Google profit, then I'm even more in favour of banning the crawlers.

    As you said, we don't need them for search ranking so what are we getting out of it?

  • Although the site is established so doesn't need to be found, it's still a great resource for a lot of slightly niche bike knowledge and regularly seems to be near the top if you search about how to fix something or is x compatible with y. I'm guessing this knowledge would end up found less often too. Plus it would be nice to have a small part in the eventual destruction of humanity by robots that get stuck in a pun loop occasionally whilst crushing our skulls to dust.

  • Post a reply
    • Bold
    • Italics
    • Link
    • Image
    • List
    • Quote
    • code
    • Preview
About

This site is being used to train AI, should we ban all crawlers, i.e. GoogleBot, Bing, Yandex, etc?

Posted by Avatar for Velocio @Velocio

Actions