• It's possible, but not trivial in an automated way if the task of the crawlers is to crawl millions of websites as quickly as possible.
    It's not worth it unless a human wants it to specifically target certain websites (which is unlikely in this situation).

  • Fuck, that makes me sound nice. Stupid fucking computers.

  • what's stopping a bot having a user login

  • I like the idea of members-only non-bike chat, but being generous and sharing bike knowledge with the bots and non-members. Spread that bike chat around.

  • We don't actually use ReCAPTCHA.

    We rely on the fact that they need JavaScript to launch a dialog and navigate a passwordless email flow and then populate the dialog.

    It turns out it is very very effective.

    We have spammers, those ones are humans.

  • Does Hippy identify as human?

  • I've just set Misc & Meaningless to "Members only", meaning it now disappears from view for guests.

  • Broke my Weekly Photo Comp bot

    Edit: never mind - just realised the API will take my access_token

  • Would you like an access token instruction?

    Basically look for your cookies on this site, add it as ?access_token= to the URL, and then you're good.

    You're using the API right?

  • Just figured it out at the same time :) All good

  • Nice, I tried to make it obvious to developers 😁

  • I am also going to block the following ASNs:

    • AS8075 - MICROSOFT-CORP-MSN-AS-BLOCK
    • AS15169 - GOOGLE
    • AS60868 - BRANDWATCH-AS
    • AS13238 - YANDEX
    • AS14618 - AMAZON-AES
    • AS714 - APPLE-ENGINEERING
    • AS136907 - HWCLOUDS-AS-AP HUAWEI CLOUDS
    • AS50300 - CUSTDC
    • AS32934 - FACEBOOK

    These represent 99% of bot traffic.

    @cyclotron3k does any of these impact you? If so, what is(/are) your public IP address(es)?

  • My bot runs on a raspberry pi sat on my desk, so I'll be ok.

  • Also... since blocking those AS numbers barely an hour ago... all systems are reporting better performance.

    Bing alone was sending 47K requests per day(!) as they've massively increased their crawl recently — my speculation here is that it is due to them realising that AI/ML provides a second value on top of the search index, and that if they bet big enough that Microsoft can challenge Google on search and advertising. This Bing crawl rate was double the Google one.

  • Another vote to ban. For everyone’s sake, even the AIs.

  • I asked Bard:


    1 Attachment

    • Screenshot 2023-04-24 at 10.09.58.png
  • 47K requests per day(!)

    47,000 of anything is a lot, but, for a scale, how many requests in total was the site receiving?
    Also, thanks for being on top of this and allowing those with better knowledge of the subject to opine.

  • Bing was about 5% of the total requests for pages yesterday, there were 970K page requests.

    Though with a long-tail of bots now being squashed, and Misc & Meaningless now going private, I expect the page view count to go down significantly and to better represent page views by members of the community rather than just a big total including bots and guests.

  • It’s as if a million robotic voices cried out in terror and were suddenly silenced.

  • We've so far shed about 400K requests per day if the current trend holds... over 40% of total volume was bots or guests.

  • How much are you going to charge for api access to the vital lfgss corpus of data?

  • Hah, API is wide open. But that would mean someone read the docs, which I don't they will

  • I joke but if you look at what Reddit might be charging and whilst lufguss clearly isn’t in that league, the corpus of data here would be quite->really valuable. Could you agree exclusivity with one of those companies for $x per year?

  • At this point in time it's a data land grab, volume is more valuable than quality — so we would be very low value.

    Additionally companies like Google, Microsoft, OpenAI — none of them want tens of thousands of contracts with content suppliers. They can barely cope with a few tens of contracts with entities like Reddit, Conde Nast, etc. We are too small to be a consideration.

    If they pivot to quality.
    If they want some niche bicycle conversations.
    If they want colloquial UK English.
    If they cannot find those things through other sites that remain open and care less about the trade-offs.

    Then perhaps we'd be in a good position. But that's a lot of unlikely ifs.

  • Just figured it out at the same time :) All good

    I may have actually killed your bot now, if so, please have it use a bespoke user-agent, and then DM me what it is and I will add the user agent to an allow list. A random string of characters is a good user-agent :D

  • Post a reply
    • Bold
    • Italics
    • Link
    • Image
    • List
    • Quote
    • code
    • Preview
About

This site is being used to train AI, should we ban all crawlers, i.e. GoogleBot, Bing, Yandex, etc?

Posted by Avatar for Velocio @Velocio

Actions