Michael Gaare on Datomic Onprem storage implementation
Datomic Onprem's code/data locality model assumes perfect caching which works out because immutability (major handwaving). Datomic Onprem makes some very specific tradeoffs wrt indexes and storage implementation, and at scale this abstraction does leak, which is why Datomic makes us give hints in the form of storage partitions.
What about Datomic Cloud? Check out my Conj 2017 Datomic Cloud workshop notes (2017) which cover the differences.
The following is written by Michael Gaare following a real-life conversation and is re-printed here with permission.
Why not dynamically add levels to Datomic's 3-level tree?
MICHAEL GAARE: Everything will get slower, both in the peers and transactor. Currently they can keep the root node and big segments of the directory (the second level) in cache. That means getting a segment (the third level) generally just requires one fetch from storage. The worst case scenario is two fetches if the relevant part of the directory needs to be fetched. Note that performing both query and transaction operations can require fetching multiple segments. Each of these operations will get significantly slower if there's another directory level. It's quite unlikely that very much of that lower-level directory would be able to be cached at any time, so your average case number of requests to storage to fetch a segment has gone up by 50-100%. This is particularly problematic for the transactor, since Datomic's design already has its writes constrained.
MICHAEL GAARE: The moment that it resizes from 3-level to 4-level will require re-writing a significant portion of the index. It's possible that this could be optimized so that just the directory layer gets restructured at first, I'm not sure. The worst case would be that the entire index would have to be re-written. It's possible that the transactor would have to stop accepting writes for a significant period of time when this happens.
MICHAEL GAARE: Larger database size puts particular pressure on the transactor with the current design of Datomic. The problems would only get worse if the tree structure gets deeper. Even with the current design, with larger databases it's possible to get yourself into a situation (eg, by doing a large number of excisions at once) where your transactor is crashing and you cannot recover except by restoring from a working backup.
Segment storage (fressian + gzip)
MICHAEL GAARE: A segment is a set of Datoms representing a slice of a particular index. Depending on the size of the Datoms, there could be (according to what I've read anyway) 1,000-5,000 Datoms in a segment. To store a segment, Datomic encodes it as binary with Fressian and then gzips the results. To read, it does the opposite. This allows for efficient storage and retrieval. It does have some implications though. It means there's no "give me this specific Datom" access pattern from storage. To get an individual Datom, the peer or the transactor need to read the segment it's in, decompress and deserialize, at which point all the Datoms from the segment are in the cache.
MICHAEL GAARE: This also means that there are some pathological read and write patterns in Datomic. Given a large enough database, if you run a query that ends up doing a large amount of effectively random access of Datoms in a given index, it's possible that you'll just blow up your peer. If your write patterns involve adding Datoms in many disparate segments, transactor performance will seriously suffer as it will have to re-write each of those segments completely at each re-index.
MICHAEL GAARE: That's one reason why you need to think carefully about using Datomic's partitioning features if you're going to be running large databases. If you design your partitioning scheme to match your read and write patterns, that will help Datomic by grouping related data together physically in the indexes, reducing the number of segments that need to be touched.