I was just thinking, since Gemini is line based, I'd imagine you could concurrently parse every line of a gemtext document. Has anybody ever tried this? I do wonder if it would actually be faster, and whether it's significant enough for the effort it takes.

I'm currently writing a test parser to concurrently parse lines of a Gemtext document using a thread pool in Zig just to see what happens. I'd imagine the speed increase for small documents is almost inexistant, but maybe this parser will work well for larger documents or more complex document formats.

Posted in: s/Gemini

๐Ÿš€ clseibold

Mar 25 ยท 3 months ago ยท ๐Ÿ‘ fab, yaky

17 Comments โ†“

๐Ÿš€ mono ยท Mar 25 at 10:40:

without actually measuring it, I guess parsing gemtext is so trivial that a thread adds more overhead than it would take to just parse the line already. you could bucket them to use the threads more effectively, but spreading and merging the lines adds more overhead again. I don't think it's worth it, even for big documents, as long as they fit into memory.

๐Ÿš€ clseibold [OP] ยท Mar 25 at 10:40:

@HanzBrix I feel like since most gemtext lines just have a 3-char prefix, it wouldn't actually matter, *unless* you intend to add in additional parsing for inline markup, then it might actually be interesting. I like supporting inline markup for my gemtext clients, so I think since I'm going to be doing that, I might as well give concurrent parsing a try.

I've got a concurrent parser done and working, although I still need to add the inline markup parsing stuff next. After that I'll see if I can post some basic benchmarks.

๐Ÿš€ clseibold [OP] ยท Mar 25 at 10:42:

@mono You're right that since gemtext is so simple and only requires looking at the first 3 chars of a line, it probably won't matter. But I still wonder if it'll help with the inline markup parsing that I want to add.

I'm mostly considering this though for parsing a more complicated format for my Scroll protocol.

๐Ÿš€ mono ยท Mar 25 at 10:43:

@clseibold looking forward to your benchmark numbers, hope you'll publish them here

๐Ÿš€ clseibold [OP] ยท Mar 25 at 10:55:

@HanzBrix Nah, the spec says nothing about inline markup last time I checked. I don't see how inline markup reduces readability for the visually impaired. I believe some text-to-speech readers can understand *emphasis* and **strong**, if that's what you're referring to.

Visual impairment is not even the reason inline markup was not included in the spec. It wasn't included because the gemtext spec wanted every line to be parseable by the first 3 bytes so that the protocol is ultra easy to parse. That and a lot of people at the time mistakenly believed that parsing inline markup was "too hard", which I still think is a bit ridiculous.

Regardless, supporting inline markup is not the same as forcing people to use it. I'm actually writing a scrolltext parser, but the scrolltext is close enough to gemtext that I've just been calling it a gemtext parser here because people might not know what scrolltext is. I use mostly the same parser for both gemtext and scrolltext, so implementing it for scrolltext means it works for gemtext too.

๐Ÿš‚ MrSVCD ยท Mar 25 at 11:31:

Doesn't unicode include control characters for bold and italics etc.

๐Ÿš€ clseibold [OP] ยท Mar 25 at 11:35:

@MrSVCD No, it contains bold and italics characters for *mathematics* only. They are not control character and they should *never* be used for strong and emphasis, for a lot of different reasons (they mess up string equality, searching, and the list goes on and on). They are also actually horrible for screen readers, and some programs will detect them as spam because spam emails (and homoglyph attacks) use them. Lastly, they are limited to specific characters, and so don't work for all languages.

๐Ÿ›ฐ๏ธ lufte ยท Mar 25 at 17:40:

What about preformatted lines? You would need a sort coordinator that tells other threads if the line they will parse is preformatted or not, and for that you need the coordinator parsing the line on its own.

๐Ÿ’€ requiem ยท Mar 25 at 18:16:

Preformatted necessitates parsing line-by-line.

๐Ÿ›ฐ๏ธ lufte ยท Mar 25 at 18:35:

What I mean is that the coordinator is already parsing the line to do that, as it will need to check the first three characters and by then the line is mostly parsed, except maybe for links which are more complex. Anyway, nothing better than data to beat pure speculation.

๐Ÿš€ clseibold [OP] ยท Mar 26 at 05:03:

@lufte @HanzBrix Yeah, so one thing I'm trying is to just parse the toggles and then just send everything off to the renderer and the renderer handles whether and where preformat should be toggled, rather than marking which lines are in preformats and which aren't.

Every parsed line contains a slice of the original text, so while preformatted lines will get parsed, I can also look at the original text when the renderer determins that we're in a preformatted toggle. I might just scrap the idea though, idk.

But yeah, this is a good point that if you were to parse preformats correctly, then you would need to do it more linearly.

It might be better to parse the *line type* linearly, and then send the italics and bold parsing off to a thread after you've done the line type parsing.

๐Ÿš€ clseibold [OP] ยท Mar 26 at 05:10:

@HanzBrix Since I'm using a thread pool, the overhead would just be the context switch of the thread? I don't have much experience in multithreading, so I don't know how expensive that context switch is.

Italics and bold parsing does require allocation of multiple substrings, which is why I wanted to experiment with this idea in the first place.

๐ŸฆŽ bluesman ยท Mar 26 at 05:23:

Apropos of (almost) nothing. This thread - no pun intended - got me thinking. I changed the following code in my renderer:

doc.lines().forEach(line->process(line));

to the following:

doc.lines().parallel().forEach(line->process(line));

I can't say it made anything faster but it did scramble the pages in an amusing fashion.

๐Ÿš€ clseibold [OP] ยท Mar 26 at 05:27:

@bluesman Oh, yeah. I fix this by just adding a new element to an "ArrayList" (a dynamic array) and *then* spawning the thread to fill in that specific slot in the array. This ensures that things are still ordered correctly.

๐Ÿš€ clseibold [OP] ยท Mar 26 at 05:31:

I think this parsing lines in parallel thing is probably best suited for a text format with more complex lines than gemini. For example, CSV/TSV might work well with it, perhaps?

Anyways, once I figure out how to time things, I'll post benchmarks.

๐ŸฆŽ bluesman ยท Mar 26 at 14:35:

@clseibold Yeah, there are ways to maintain order in my example (forEachOrdered), etc but I'd have to change how everything works. My "scrambled" example breaks an important rule: only update Swing UI components on the Event Dispatch Thread.

To truly embrace the idea, I'd just have to populate a list of lines with their attributes in parallel, collect them in order and then update the UI separately on the EDT. I'm guessing any performance gains would be minimal - it could even be slower.

๐Ÿ’€ requiem ยท Mar 27 at 09:00:

Hahaha, amazing! Post output! ๐Ÿ˜€


Source