Code/Data Locality, and Databases of 2020s

Interesting bits of a conversation on HN: https://news.ycombinator.com/item?id=15672027
BBOREHAM: I wonder if the 2020s column store would outperform kdb, which was written in the 1990s with a UI from the 1950s.
GEOCAR: Unlikely. Two reasons are at the top of my mind:
1, The current best efforts in benchmarking are focusing on queries that "look" similar, and yet kdb is still 400x faster than Hadoop for those queries. For example:
select avg size by sym,time.hh from trade where date=d,sym in S
SELECT sym,HOUR(time),AVG(size) FROM trade NATURAL JOIN s WHERE date=d GROUP BY sym,HOUR(time);
To answer this question, the database has to read two or three columns across ten billion rows -- it's hard to be much faster than kdb: 10 billion rows completes in 70msec on kdb, but Hadoop takes something like 30 seconds.
The 2020 column store has to do a lot of work to even match kdb, but assuming it does that, and even ekes out a few extra percent of performance on these queries, there's another issue:
2, Most kdb programmers don't write this way.
Sure some write their application in Java and send these queries over to the kdb "server" get the results, and do stuff with the results, etc., just like the application programmers that use Hadoop, but most kdb programmers don't. They just write their application in kdb.
That means that there isn't an extra second or two delay while this chunky result set is sent over to another process. — geocar @ HN
UDF/Stored Procedures/Foreign procedures are the rest of the world's solution for this problem, and they are massively under-utilised: Tooling like version control and testing of stored procedures just doesn't work as well, and I don't see any suggestion that's going to change in the next decade or so.
DUSTIN GETZ: Yes! "code/data locality" is key for real apps. Everyone has seen "numbers every programmer should know" https://gist.github.com/jboner/2841832. If you're going to do complex data analysis, e.g. machine learning, you want your data access latency to be on the short end of this chart :) When you end up on the long end (like in RDBMS) this is known as N+1 problem.
But modern size data doesn't fit into memory. Distributed systems necessarily add latency, and to fix that we add caching, which hurts consistency. I blew up this thread further down about how Datomic's core idea is to provide consistent data to your code; which is the opposite of how most DBMS (including kbd) make you bring the code into the database.
DUSTIN GETZ: Datastore of 2020s will be designed around an immutable log because it permits both strong consistency and horizontal scaling (like git).
Once you're both distributed and consistent, the problems today's stores are architected around, go away. Your distributed queries can index the immutable log however they like. column-oriented, row-oriented, documents, time-oriented, graphs, immutability means you can do all of it, as a library in your application process
http://www.datomic.com/ - it's what you get when Facebook's graph datastore has a baby with immutability.
SamReidHughes: You can't just say "immutable log" and then be done. You certainly don't want to have just one immutable log, because then unrelated operations, for example to different parts of a key space, have to "see" each other. If you go the route of Datomic, your writes can't outpace the one CPU that processes them. (Correct me if I'm wrong, I'm just reading its documentation.)
DUSTIN GETZ: In RDBMS, when you shard, read shards and write shards are in lock-step, which is the whole problem with sharding. In Datomic (and in git), by sharding writes, it doesn't really impact reads.
This is interesting, because consider a large system like Facebook. Transactions naturally fall within certain boundaries. You never transact to Events, Photos, and Instagram all at once - from the write side, they don't have to share the same single-writer process delivering ACID.
You do however, on the read side, need to have fluid queries across them all, as if they were one database. RDBMS can't do that, but Datomic can, and Git can too - consider submodules. Immutability is what makes it possible to shard like this without sacrificing query expressiveness, strong consistency or ACID (like every other distributed system that isn't accumulate only)
MEJ10: I was under the impression, based on its docs, that Datomic only supports processing transactions serially through a single transactor.
DUSTIN GETZ: Both what you write and what I wrote are true.
To scale writes you shard writes. This generally means running multiple databases (in any DBMS)
The key insight is that Datomic can cross-query N databases as a first class concept, like a triple store. You can write a sophisticated relational query against Events and Photos and Instagram, as if they are one database (when in fact they are not).
This works because Datomic reads are distributed. You can load some triples from Events, and some triples from Photos, and then once you've got all the triples together in the same place you can do your queries as if they are all the same database. (Except Datomic is a 5-store with a time dimension, not a triple store.)
In this way, a Datomic system-of-systems can scale writes, you have regional ACID instead of global ACID, with little impact to the programming model of the read-side, because reads were already distributed in the first place.
For an example of the type of problems Datomic doesn't have, see the OP paragraph "To sort, or not to sort?" - Datomic maintains multiple indexes sorted in multiple ways (e.g. a column index, a row index, a value index). You don't have to choose. Without immutability, you have to choose.
manigandham: All datastores already have WAL logging which is effectively the same, and commonly used for replication, changefeeds and other downstream consumers. Saving the entire history (with compaction) and some CQRS patterns is nothing new.
At any decent scale, most companies now just use a dedicated log like Kafka or Pulsar as the main backbone to support more flexibility in producers and consumers. Either way, none of this has to do with column-stores as the actual representation of data.
DUSTIN GETZ: It's definitely not new, but it is innovative. Kafka can totally be an implementation detail of a system like what we are discussing. Once you're immutable, we're no longer constrained to a single "actual representation of data"; you can maintain many in parallel, so long as there is a way to keep the representations consistent (that time dimension is really important!)
CQRS has the right fundamental constituents but puts them together in the wrong way, I think. The command abstraction is in the application layer (we're talking about stores not apps) and the properties of the read-side are fixed (e.g. decisions about document-orientation, column- or row- are coded in advance). But those same parts can be used to make something more flexible, that lets the properties of the read-side be less fixed.
Datomic maintains multiple builtin indexes to support several query styles (so multiple parallel "representations of data") http://docs.datomic.com/indexes.html, so Datomic has native support for querying in the shape of: documents, rows, columns, time, values. The storage is actually represented all those ways in parallel copies. (and yet the time dimension keeps us linearized and strongly consistent, like git!)
More interesting than the builtin indexes though, is that you can conceptually implement your own index, since immutability lets you distribute/cache the data in your application processes, the query engine is actually a library running in your query process. (Datomic Datalog is literally a jar file running in your elastic app processes and it works on vanilla JVM data structures)
This is called "code/data locality" and it's extremely powerful. You don't need to go fork the database codebase to add a new type of index, like people had to do to add geospatial index to a fork of Postgres. You can incrementally maintain your own indexes. You can ignore datalog and implement your own query functions to query your special index. Or you can seamlessly compose your own query functions inside datalog queries, you can pass your index as an input to a datalog query. Here's a snippet of what that looks like: https://i.imgur.com/GJuTkJR.png
NOAH DESU: What is the primary reason people choose Datomic? From reading the website, I get the impression that the append-only nature and time-travel features are a major selling point, but in other places its just the datalog and clojure interfaces. I'm sure it's a mix, but what brings people to the system to begin with?
DUSTIN GETZ: Datomic is my default choice for any data processing software (which is almost everything).
Immutability in-and-of-itself is not the selling point. Consider why everyone moved from CVS/SVN to Git/dvcs. The number of people who moved to git then and said "man I really wish I could go back to SVN" is approximately zero. Immutability isn't why. Git is just better at everything, that's why. Immutability is the "how".
I don't see why I would use an RDBMS to store data ever again. It's not like I woke up one day and started architecting all my applications around time travel†. It's that a lot of the accidental complexity inherent to RDBMS - ORM, N+1 problems (batching vs caching), poorly scaling joins, pressure to denormalize to stay fast, eventual consistency at scale... Datomic's pitch is it makes all these problems go away. Immutability is simply the how. Welcome to the 2020s.
actually I did, my startup is http://hyperfiddle.net/
ELVINYUNG: I think one interesting project in the near future could be to try and build a column-oriented storage engine that's "good enough" for both OLAP and OLTP workloads.
The main precedent here is Spanner's Ressi storage engine, which, according to the most recent paper [1], uses a PAX-like format (with blocks arranged row-oriented, but values within a block are column-oriented, so kind of like Parquet) for on-disk data, but combines it with a traditional log-structured merge tree for writes and point queries.
redditaddict: SAP Hana is an example of a system that fits into this category. This isn't new either and has been existing since the 90s. Sybase IQ (which SAP acquired) was the first commercially successful columnar database. They have an in-memory row engine to handle OLTP. OLAP queries perform exceptionally well due to the column oriented nature of the storage. Customer deployments are in the 100s of TBs and low PBs these days. Blows most open source software in terms of performance if you are willing to shell out the $. Source: I work at SAP. reply
ELVINYUNG: I don't mean a database frontend that can handle both OLTP and OLAP workloads, usually by having some kind of OLAP column-store and some kind of OLTP main memory row-store. I know there's a lot of those (not only HANA, but also MemSQL, SQL Server, etc.)
The fun thing to try and imagine here is having literally the same physical data format that works for both kinds of workloads.
DUSTIN GETZ: You actually don't need to have the same storage data layout if you use a time series as a starting point; because you can maintain different data layouts in parallel, and the time dimension permits strong consistency across them all.
If this is what you mean by a "database frontend", I am really confused as to why you object to this?
I think this property of time series is going to prove very important in the 2020s