Is Datomic strictly better than Facebook's graph datastore?

My notes and highlights from TAO: The power of the graph (2013) by Mark Marchukov. All emphasis added, including headings.
This post compares Datomic (today, 2017, post-Datomic Cloud announcement) to Facebook's graph datastore as described in 2013. They are almost the same, except TAO is a triple store & eventually consistent; Datomic is a 5-store and strongly consistent. Contrary to popular belief, Datomic's single-writer-per-database ACID is exactly the same as Facebook's single-writer-per-shard, so I think Datomic Cloud can absolutely scale up to to Facebook-sized write loads, and then beyond due to immutability.
  • This post about TAO is from 2013, Datomic Cloud was just announced at Conj '17.
  • is Datomic deployed at Facebook scale anywhere? Seems unlikely
I would like to discuss this with you, reach me on reddit, email or Clojurians slack @dustingetz.

Facebook load is dashboard shaped, read dominated

Every time any one of over a billion active users visits Facebook through a desktop browser or on a mobile device, they are presented with hundreds of pieces of information from the social graph. Users see News Feed stories; comments, likes, and shares for those stories; photos and check-ins from their friends -- the list goes on. The high degree of output customization, combined with a high update rate of a typical user’s News Feed, makes it impossible to generate the views presented to users ahead of time. Thus, the data set must be retrieved and rendered on the fly in a few hundred milliseconds.
the data set is not easily partitionable, and by the tendency of some items, such as photos of celebrities, to have request rates that can spike significantly
Multiply this by the millions of times per second this kind of highly customized data set must be delivered to users, and you have a constantly changing, read-dominated workload

RDBMS does this poorly

even the best relational database technology available is a poor match for this challenge unless it is supplemented by a large distributed cache that offloads the persistent store.
The spatial locality on ordered data sets that a block cache attempts to exploit is not common in Facebook workloads. Instead, what we call creation time locality dominates the workload -- a data item is likely to be accessed if it has been recently created. Another source of mismatch between our workload and the design assumptions of a block cache is the fact that a relatively large percentage of requests are for relations that do not exist -- e.g., “Does this user like that story?” is false for most of the stories in a user’s News Feed. Given the overall lack of spatial locality, pulling several kilobytes of data into a block cache to answer such queries just pollutes the cache and contributes to the lower overall hit rate in the block cache of a persistent store.

2005: Zuck adds memcache and it helps

Memcache has played that role since Mark Zuckerberg installed it on Facebook’s Apache web servers back in 2005
The use of memcache vastly improved the memory efficiency of caching the social graph

Rich Hickey on Caching :)

RICH HICKEY (2012): What goes in the cache? What form does it take? When does it get invalidated? Whose problem are all these questions? Yours! Your problem!
Or maybe you're buying some fancy ORM that makes it your problem with another a layer on top of your problem. Now you have two problems.
Not that there were better options back in 2005:
  • 2004 Feb - Facebook Harvard launch
  • 2005 Oct - Expands to 21 universities
  • 2005 Dec - 6 million users (2000 colleges)
  • 2006 Sep - open to public
So mysql, memcache and sharding got facebook really far, it was just a pain to code:

Product engineers now must care about backend complexity

the code that product engineers had to write for storing and retrieving their data became quite complex. Even though memcache has “cache” in its name, it’s really a general-purpose networked in-memory data store with a key-value data model. [Memcache] will not automatically fill itself on a cache miss or maintain cache consistency. Product engineers had to work with two data stores and very different data models: a large cluster of MySQL servers for storing data persistently in relational tables, and an equally large collection of memcache servers for storing and serving flat key-value pairs derived (some indirectly) from the results of SQL queries. Even with most of the common chores encapsulated in a data access library, using the memcache-MySQL combination efficiently as a data store required quite a bit of knowledge of system internals on the part of product engineers. Inevitably, some made mistakes that led to bugs, user-visible inconsistencies, and site performance issues. In addition, changing table schemas as products evolved required coordination between engineers and MySQL cluster operators. This slowed down the change-debug-release cycle and didn’t fit well with Facebook's “move fast” development philosophy.
AKA Backend for Frontend anti-pattern, and Object/Relational Impedance Mismatch.
I've spoken about the backend-for-frontend/ORM anti pattern, that essay is very much a WIP, here are some slides about BFF and ORM for now until that essay is finished.

2007 - "Objects and Associations" datastore

The API was an immediate success, with several high-profile features, such as likes, pages, and events implemented entirely on objects and associations, with no direct memcache or MySQL calls.
As adoption of the new API grew, several limitations of the client-side implementation became apparent. First, small incremental updates to a list of edges required invalidation of the entire item that stored the list in cache, reducing hit rate. Second, requests operating on a list of edges had to always transfer the entire list from memcache servers over to the web servers, even if the final result contained only a few edges or was empty. This wasted network bandwidth and CPU cycles. Third, cache consistency was difficult to maintain. Finally, avoiding thundering herds in a purely client-side implementation required a form of distributed coordination

2009: TAO "The Associations and Objects"

In early 2009, a team of Facebook infrastructure engineers started to work on TAO (“The Associations and Objects”). TAO has now been in production for several years. It runs on a large collection of geographically distributed server clusters. TAO serves thousands of data types and handles over a billion read requests and millions of write requests every second.
The following headers are eerily similar to Datomic

Inverse associations are automatic

For every association type a so-called inverse type can be specified ... The intent is to help the application programmer maintain referential integrity for relationships that are naturally mutual, like friendship, or where support for graph traversal in both directions is performance critical, as for example in “likes” and “liked by."
Datomic does this too

CRUD operators

The set of operations on objects is of the fairly common create / set-fields / get / delete variety.

Additive schema

All objects of a given type have the same set of fields. New fields can be registered for an object type at any time and existing fields can be marked deprecated by editing that type’s schema. In most cases product engineers can change the schemas of their types without any operational work.
Same as Neo4J and Datomic, all graph dbs have additive schema, which means schema changes NEVER break existing code, it will always continue to work, you can essentially do it live.

Read classes: Point, Range, Count

Point queries look up specific associations identified by their (id1, type, id2) triplets
Range queries find outgoing associations given an (id1, type) pair. Associations are ordered by time
Count queries give the total number of outgoing associations for an (id1, type) pair. TAO optionally keeps track of counts as association lists grow and shrink, and can report them in constant time
Datomic query (datalog with rules and entity navigation) is quite a bit more expressive than this, but it's all implemented in the peer. Datomic Peer's interface to Storage is much simpler.

Code/data locality

We have kept the TAO API simple on purpose. For instance, it does not offer any operations for complex traversals or pattern matching on the graph. Executing such queries while responding to a user request is almost always a suboptimal design decision. TAO does not offer a server-side set intersection primitive. Instead we provide a client library function. The lack of clustering in the data set virtually guarantees that having the client orchestrate the intersection through a sequence of simple point and range queries on associations will require about the same amount of network bandwidth and processing power as doing such intersections entirely on the server side. The simplicity of TAO API helps product engineers find an optimal division of labor between application servers, data store servers, and the network connecting them.
Datomic does exactly this but with strong consistency

Query/Storage/Writer separation

The TAO service runs across a collection of server clusters geographically distributed and organized logically as a tree. Separate clusters are used for storing objects and associations persistently, and for caching them in RAM and Flash memory. This separation allows us to scale different types of clusters independently and to make efficient use of the server hardware.

Storage (MySQL based triple store?)

We continue to use MySQL to manage persistent storage for TAO objects and associations.
Point queries look up specific associations identified by their (id1, type, id2) triplets.
Unclear if TAO actually stores triples natively in MySQL or something isomorphic with triples?
Michael Gaare describes problems with triple stores. We immediately know that triple stores are eventually consistent. More on this later.

Eventual consistent cache coordination between clusters

Client requests are always sent to caching clusters running TAO servers. In addition to satisfying most read requests from a write-through cache, TAO servers orchestrate the execution of writes and maintain cache consistency among all TAO clusters.
This implies TAO is eventually consistent by default, with an API to become strongly consistent.

Sharding (Partitioning)

The data set managed by TAO is partitioned into hundreds of thousands of shards. All objects and associations in the same shard are stored persistently in the same MySQL database, and are cached on the same set of servers in each caching cluster. Individual objects and associations can optionally be assigned to specific shards at creation time. Controlling the degree of data collocation proved to be an important optimization technique for reducing communication overhead and avoiding hot spots.

Tuning query performance

Shards can be migrated or cloned among servers in the same cluster to equalize the load and to smooth out load spikes. Load spikes are common and happen when a handful of objects or associations become extremely popular as they appear in the News Feeds of tens of millions of users at the same time.
Just like Datomic Peer, and Datomic Cloud task-specific query groups

Queries are served from two layers of peers

There are two tiers of caching clusters in each geographical region. Clients talk to the first tier, called followers. If a cache miss occurs on the follower, the follower attempts to fill its cache from a second tier, called a leader. Leaders talk directly to a MySQL cluster in that region.
Looks like Facebook's Mysql stores triples and is thus eventually consistent.
Datomic Pro is similar - Peer layer -> storage cache -> storage. Datomic storage is 5-tuples with a time dimension, so storage segments are immutable and strongly consistent

Eventually consistent

We chose eventual consistency as the default consistency model for TAO. Our choice was driven by both performance considerations and the inescapable consequences of CAP theorem for practical distributed systems, where machine failures and network partitioning (even within the data center) are a virtual certainty. For many of our products, TAO losing consistency is a lesser evil than losing availability. TAO tries hard to guarantee with high probability that users always see their own updates. For the few use cases requiring strong consistency, TAO clients may override the default policy at the expense of higher processing cost and potential loss of availability.
Facebook got this wrong. Datomic solves this the right way by adding a time dimension to the data.
Datomic first ideas put to paper in 2007, spoke Rich Hickey at Conj '17. So Datomic and TAO started at exactly the same time.
To understand how Datomic actually achieves the distributed caches being strongly consistent and also fast, we need to dive into Datomic's storage implementation details. It's a 3 level tree and storage is partitioned. In Datomic Pro the partitions are user defined. In Datomic Cloud, partitions are determined automatically via :identity :unique (lookup refs), I think. Here is a conversation about Datomic storage tradeoffs.

Single-master per shard (per-shard ACID?)

We run TAO as single-master per shard and rely on MySQL replication to propagate updates from the region where the shard is mastered to all other regions (slave regions). A slave cannot update the shard in its regional persistent store. It forwards all writes to the shard’s master region.
I think this implies TAO has per-shard ACID.
Datomic also has per-database ACID. Databases are limited to roughly 10 billion datoms, consequences of this limit matters at scale. Datomic does multiple-database joins first-class. I bet Facebook can too as it's a triple store.
Does shard basically equals Datomic Database? Can you have hundreds of thousands of Datomic Database? I say yes! It's the ACID boundaries that matter. Can your write domains be isolated?
Most transactions don't need ACID across the entire world, most transactions are only ACID in certain domains and unrelated domains never need to be considered. For example a write to Photos and a write to Events do not need to be in the same ACID boundary.
So I think Datomic single-transactor-per-database equals TAO single-master-per-shard.
If necessary, the mastership can be switched to another region at any time. This is an automated procedure that is commonly used for restoring availability when a hardware failure brings down a MySQL instance.
Datomic Cloud does hot transactor failover per database (= shard), but not across regions per my understanding. Is there a path forward to cross region replication in Datomic Cloud?
Datomic Pro I think cannot transparently fail across regions. For that to be possible, the replicated storages would need to act as a single ACID system, so the notion of "region" wrt storage gets fuzzy. I dont think that's how AWS regions are meant to work, and critically DynamoDB.
ANATOLY POLINSKY: ... requirement is to write to both data centers (based on geography), but read from any data center with the expectation of not doing two reads (if it is not on this data center, read from that one). + having data replicated. What would be a way to achieve that? (Since Datomic can only write into a single place)
STUART HALLOWAY: This usage is not supported, and cannot be made to work.
Datomic Pro's hot transactor failover is also not transparent - there's a heartbeat interval and heartbeats need to be missed a couple ticks in a row, so there is a pause.
In Datomic Cloud, can parallel Datomic Transactor service different partitions at the same time?
Wrt reads, I think Datomic Pro can do cross-region reads: Datomic | Cross region transactors and DynamoDB integration

read-after-write consistency

(Users want to see that their recent writes worked)
All TAO writes go through followers to leaders. Caches are updated as the reply to a successful write propagates back down the chain of clusters.
The write-through design of cache simplifies maintaining read-after-write consistency for writes that are made in a slave region for the affected shard.
Same as Datomic Transactor which streams back writes to the peers, which if same datacenter is on order of 1-10ms. So you have to get your write sharding correct to make this fast, obviously.

More Reading

Here is the USENIX ATC '13 paper linked from the TAO blog post, I haven't read it yet.