My Experience with Flat-File Databases

2024-10-02

---

Pollux.casa resident HanzBrix recently pondered about databases using flat files. His most recent reply to the discussion^ laid out a hypothetical data storage system for Station or BBS in which user information is kept in separate gemtext files. This effectively splits out data which would be single rows in a database table to separate files, enabling fine control over process locking, with the added feature that the data files can be served directly without further parsing.

In truth, I already use a scheme like this for one of my capsule's services. Almost all of the CGI scripts on Rob's Gemini Capsule use Python with a SQLite backend, but one does not: my OEIS mirror.^^

OEIS was my first experience working with a relatively large dataset. At time of writing, the encyclopedia has about 375000 sequences, the shortest having 1 term and the longest having just over 500. The challenge I set for myself was to create a mirror of the encyclopedia in gemtext that would run as efficiently as possible on my modest VPS.

When I built out the mirror, I'd worked with SQLite databases a little bit, but I wasn't familiar with advanced concepts like indexes. My initial approach was to build a single-table database that had all the sequence information in it; when a user searched for a sequence or inputted a sequence name, a script would pull the necessary information from the table and build a gemtext page on the fly. Upon testing, I found that querying the database, especially searching for substrings in sequences, was unacceptably slow. I wasn't sure how to speed it up, and I didn't want to switch to an always-running database like MariaDB, so I turned to flat files instead. The mirror uses still the flat file paradigm today.

I started by cloning the OEIS's official "oeisdata" GitHub repository. A script crawled the data and created two CSV files: one containing the full sequence for each sequence name, and one containing a list of cross-references for each sequence. Another script created a gemtext file for each sequence, using information from the cross-reference CSV to fill in links at the bottom of the page. Searching involves simply grepping the sequence CSV for a matching substring; if not too many results are returned, the CGI returns a list of all links containing the sequence string. Updating the mirror is a simple process of pulling updates to the repository once a week, checking which sequences have changes, updating the CSV files, and regenerating the affected gemtext pages.

The search function is still not very fast. It can take a few seconds to get results from the CSV file and process the results for the CGI. I'm sure using a proper database with indexes and robust data structures would be much faster. I suspect my service would not scale if the OEIS was even twice as large as it is now, never mind ten times or a hundred times.

My experience with using flat files to power the OEIS mirror has been positive. However, an encyclopedia is a special case because the data in it doesn't change very rapidly and it requires little processing once gathered. I can't imagine it would work very well for data that is constantly updated, like a forum or a social media platform.

^ Re: Is a Flatfile database a terrible idea? and Episode 2

^^ OEIS Mirror in Gemini

---

Up One Level

Home

[Last updated: 2024-10-06]


Source