Gemini Bridge

Major bug in AuraSearch's crawler was just fixed: the robots.txt parser.

When the crawler can't find a "User-Agent" string in robots.txt, this caused crashes before, so a long time ago I added code to prepend "User-agent: *" to assume that whatever follows applied to all user agents. And since most robots.txt files use robots.txt for Disallows, it would disallow the crawler from crawling those paths.

Hence, the crawler takes a conservative approach when the robots.txt is improperly written. The result being that some paths that might have been allowed are blocked by the crawler because of improperly written robots.txt files (or bugs in the robots.txt parser).

However, my check for "User-agent" was case-sensitive, so I was only checking a particular casing of "User-Agent", resulting in certain pages that were disallowed only for some user agents to also be disallowed for AuraSearch's crawler.

This has been fixed. This means the crawler will be able to crawl more pages than before, lol.

🚀 clseibold

Jul 09 · 2 days ago · 👍 sbr

2 Comments ↓

☀️ sbr · Jul 09 at 10:43:

The spec is “User-agent” not a capital “A”, so technically its use _is_ the correct approach.

Anyway, nice that you caught the bug.

🚀 clseibold [OP] · Jul 09 at 12:09:

@sbr Yep, I know what the spec says. I was just using the wrong casing in my code by accident. The code to check for a User-agent was originally added a few years ago when the crawler came upon robots.txt files that had no user-agent, but defined Disallows. That's the "improperly written robots.txt" files I'm talking about.

Source