Anthony S. Gray

It had been years since I’d used a compiled, strongly typed language, but we needed a newer, faster subsystem. It was a perfect excuse to expand my toolbox.

red-leaf

This is not the Rust fungus. I'm not sure what it is, but it's very striking

The Dfam database stores all kinds of information about various transposable element (TE) families. The consensus sequence and the data used to generate it, its curation status, and many other features are included, but we also store genomes annotated with TEs using RepeatMasker, NHMMER, and other tools. While Dfam was relatively young and small, storing the annotation data in its own related database was a reasonable solution, but we reached the point that it stopped scaling well around Dfam release 3.8 or 3.9.

The old system had an assembly table in Dfam proper, which contained information about the genome assembly that was annotated, as well as references to an external database sharing a name with the assembly. That external database contained information about the different sequences within the assembly, as well as all of the different types of annotations and links to the related families held in Dfam. There were two issues here. The first is that each assembly required its own database for each run and for each version of Dfam used to annotate it. The MySQL system can handle a few of those, but after hundreds of runs, imports and updates were starting to slow down, and even retrieving the annotations was starting to get sluggish. The second issue is that if one TE family was updated between Dfam versions, the entire set of annotations needed to be rerun. Nothing about the system was necessarily broken, but it was certainly due for an update.

The first language I learned to use was Java, and I had used C++ briefly for a class years ago, but for the most part I had just been using Python and Javascript since. I’d only heard good things about Rust, and given the scale of the project and the need for the new subsystem to be as fast as possible, it was an obvious choice. The plan was to move all of the data out of the satellite assembly databases onto the file system and index it for fast retrieval, sort of making our own, very minimal database. My supervisor put together a quick indexer, and I started building the rest of the new system around it.

It turned out that there were five data types for each assembly. The actual annoations were the largest and most complex dataset, followed by the benchmark annotations and the annoations of simple repeats. The source files for each of these were generated by a Python script and needed to be split out by sequence, compressed, indexed, and stored in their own directory by type. Two other data types were needed to allow other systems like the API to query the data effectively, those being the length of each TE family annotation’s model, and the names and lengths of the chromosomes or sequences in each assembly. Model length was only used because it was included in one of the download formats, as was the sequence length, but the sequence names were also used to test if the data existed at all. These second two data types were simply stored in JSON files.

I found the actual process of writing the indexer to be both frustrating and informative. In the end, I started and restarted the entire project three times as I would learn more about how Rust expected to operate and realize that the issue I’d been working on could have been solved with a change in project structure. It took a bit, but eventually I stopped fighting the tools and put together something that worked beautifully. The speedup, even as a subsystem to our API was so obvious that we did some basic load testing to make sure it would scale, then rolled it out.

The least comfortable part of writing in Rust was just how verbose everything needed to be. I’d never used a language with struct and enum functions (C++ probably has it, but I was still learning the basics when I used it), and a large proportion of the ending line count is spent defining the different annotation download formats and how to convert between them. Also, because Rust insists that every operation be as safe as possible, just retrieving data from a JSON file was surprisingly complicated.

My complaints were immediately obvious, since they stem from the mindset shift necessary to go from writing in Python or Javascript to writing in Rust. However, the parts of Rust I like didn’t become apparent until I switched back to Python for my next project. I felt the absence of the guardrails provided by Rust right away, as Python happily let me write terrible code without complaint. I realized that I had come to depend on the compiler to tell me when I was doing something wrong, and while Python’s looser rules make it more convenient for hacking out simple scripts, it requires much more planning and focus when doing anything complex.

In the end, I came to really like Rust. The learning curve is more of a vertical line than a curve, but now that I’m used to it, I will certainly be using it in the future. But still only when warranted.

Annotation Indexing