2022-12-09 - from quarter baked to half-baked

TL;DR: I made a python script to do the heavy lifting of the document generation that preprocesses all of my markdown files and gets them ready for conversion to Gemini.

Backstory

My original build chain fed a set of input markdown files through some very coarse sed expressions to extract tags, which were dumped into a file, and then passed through a unique sort. Tags were tallied by grep -c for each line of the unique file, and this generated my tags page with counts. Then each input file was grepped for those tags, despite the tag content being in a YAML header which meant the tags could span multiple lines and result in missing tags.

This generated a tag index and the individual tag pages that linked to deeper notes.

Git was invoked to determine the last modification date of the note and this was put in the page along with the current date for when the page was generated.

Each page was pre-processed to then add the navigation links, the tags, and inject backlinks.

I’d like to say this was all done quickly and that was the reason it was so rough and relied mostly on bash scripting for completion. Except I probably spent an hour or two each day adding each piece.

The Problem

Tow main problems:

reusability

speed

The files had to be passed through Pandoc, md2gemini, and then have content injected at different stages, so the result was that I had a really slow process that only produced Gemini pages. If I wanted to also produce HTML to mimic my Gemini capsule, it was going to require tacking on logic to undo some of the preprocessing and then just throwing my hands up for the pieces that can’t be massaged back into something that would generate comparable HTML.

Also, each time I put on another piece, the overall build process slowed down – by an order of magnitude.

Yeah, it takes several minutes to build a handful of text files. That’s pretty terrible

But don’t you already have something to produce HTML?

Yes, I do, but it’s… not the same. The content that goes into the Gemini pages now is quite different and I was only getting the link style for Gemini to work by processing things in a weird order. For example, I’d like the HTML output to look more like the Gemini output with respect to where links go.

I have a version that produces HTML and it looks cool – to me, the rest of the world would probably throw up a little bit if they saw it.

More Problems

I was originally thinking about the speed issue and not so much the HTML output issue, and I was thinking how much quicker it would be if instead of reprocessing a file at each step to get a different piece of information, I should put the tags, the titles, and the links in a database. I was also thinking I could just use SQLite from bash. It would work for the most part, and I did experiment a little with invoking SQLite and putting in tags and what-not but I spent about a half hour trying to get GNU parallels to help me in the db construction and it wasn’t going well.

And then I was thinking about how I would need to fix the Makefile content to use better functions so I could call the different pieces of new logic from a few different rules… And so I had started creating a funtions.sh file to move the more complicated scripting out of the Makefile and that’s when I ran into new problems of the pieces that should have been easily re-usable were slightly different based on where they were in the pipeline and it began to snowball.

I think when I wrote large pieces of it I was thinking I shouldn’t let the perfect be the enemy of the good and that it was just a temporary and quick thing to be able to start getting content out.

I realized if I wanted to swap out any piece of the pipeline it was going to require rewriting, and to add new output formats (and I swear I’ll be adding Anki output real soon now for specially tagged notes) that it wouldn’t even fit in the current architecture. So I figured I could write a python script that takes the input files, produces markdown output files, and then my pipeline can take those markdown files and convert them to whatever. (Did I mention I do PDF and ePUB for some stuff?)

The solution

Well, in less time than it took to get the current system to where it was before today, I managed to produce a script to generate all the intermediate documents. And it takes less than a minute to complete. And adding new input documents should only add a few fractions of a second to the total time.

And– well I think that’s it, actually.

Oh! Atom Feeds. I know have really convenient tables for generating the content of the Atom feed. I’ll add those once I finish making new targets in the Makefile that run the python script as a preprocessor.

Other uses

The sad sed & grep situation was also how I’ve been finding notes based on tags, and it suffered from the same slowness or accuracy problems, depending on which one I used at any given time. Now I may run the tags to SQLite step on saving notes and then create a few aliases to invoke SQLite for tag searches locally.

TODO

Consider Cross-posting Gemini content to HTML

Detect broken links (at least in the case where a link is in a page, but the target document is not marked as one to be published…)

Add Atom feeds

Add other publish targets?

Navigation

index