Comment by ☀️ sbr

Re: "I've just added a *very* experimental AI detector to my..."

In: u/clseibold

Semi related but I find http proxied content in gemini search engines rather annoying. It tends to be a lot of content of low quality. If there was a way to filter for authentic small web content I would use that every time. Could be some basic heuristic of too many unique results per domain or too many external http refs per page would “down rank” a sites relevance. Pages containing native => links would be preferable, especially those linking out to other gemini:// pages. A bit of a reverse rank of how many incoming links.

☀️ sbr

Jun 27 · 10 days ago

5 Later Comments ↓

☯️ dragfyre · Jun 27 at 13:17:

isn't it true that AI-generated content detectors are notoriously unreliable? last i checked (admittedly over a year ago), there were big issues with false positives showing up for real content written by English language learners, for instance.

🚀 clseibold [OP] · Jun 27 at 23:03:

@dragfyre Yeah, they can be somewhat unreliable. That's why this was an experiment to see how many false positives I'm getting within geminispace. I'm considering removing the feature, because the AI detection is way too slow to begin with.

Btw, for anyone wondering, I'm using desklib/ai-text-detector (https://huggingface.co/desklib/ai-text-detector-v1.01).

🚀 clseibold [OP] · Jun 27 at 23:11:

@sbr I understand the low quality part for some sites, but I wouldn't consider gemipedia, YT, or Twitch necessarily low quality, but also, those are not part of the results and are only limited to 1-2 results on the first page. I use this feature quite a lot, personally, so I don't have to touch the webbrowser as often for entertainment. I could add an option to disable them, I suppose.

The stackoverflow mirror on noulin is already blocked from being crawled, mainly because I didn't want the crawler spending so much time on that site before having crawled the rest of geminispace. Although, I do think having a lot of the decent knowledge of stackoverflow is useful.

The only other thing I can think of that you might be seeing in terms of proxies are the news sites? Do you have any other sites you would find annoying to see? A couple of these news proxy sites have already been blocked from the aggregator but not the search results.

One option I could try to experiment with is something a couple web search engines are doing: personalizing the search engine. You would be able to create a client cert to login and downrank specific domains for your own personalized search.

🚀 clseibold [OP] · Jun 28 at 00:02:

@sbr Btw, I wanted to mention that AuraSearch is different from Kennedy and TLGS in that AuraSearch is the only search engine that *doesn't* use a link-based ranking algorithm (not HITS/SALSA nor PageRank). I made this choice specifically because I personally don't think it makes sense to rank based on links (although downranking based on links is interesting... haven't come across that idea before).

If you want to, you can make some feature requests on the AuraSearch subspace here on BBS: gemini://bbs.geminispace.org/s/AuraSearch

I'm trying to take into account all the different use-cases and features that people would want, and if that means I'll have to create a personalization-type system, then I'll do that :D

🚀 clseibold [OP] · Jun 28 at 00:46:

Update: I have removed the AI detector from the crawler because it's way too slow. Instead, people can manually submit a URL to the new AI Detector Tool, and it will detect AI on-demand.

— AI Detector Tool

Original Post

🚀 clseibold

I've just added a *very* experimental AI detector to my search engine crawler. Unfortunately, it's *extremely* slow, so I will have to find a different solution for ai detection. I will probably have a separate process that just goes through the entire db and runs the ai detector on everything, at least to get every row filled. After that's done for everything in the db, I will just run the ai detector only on pages that have been detected as having changed. Idk... we'll see. I also need to...

💬 8 comments · 1 like · Jun 27 · 10 days ago


Source