Christopher Small's Datsys (2017)

Update from July 2018, and action items:

On Tue, Jul 3, 2018 at 2:16 AM Christopher Small wrote:

Hi Dustin, I'm fine with you sharing this, though it doesn't read as clearly as I might like.
For what it's worth, some steps have been taken in the direction of our own datalog implementation. You may have seen the https://github.com/replikativ/datahike project, which is a fork of DataScript which features durable hitchhiker tree storage. Unfortunately, this is JVM only for now, because it assumes blocking reads. So in order to port to cljs, much of the code will have to be rewritten with go-blocks. In theory, this may not be too difficult (just tedious and time consuming), but it's sort of hard to muster the energy when the effort could go towards a fully incremental/differential datalog which would solve performance issues properly without resorting to hacks like posh.
I should let you know that I'm taking a step back from this project as other endeavors are demanding more time than I can spare otherwise. But Brian Marco (ccd) is going to be taking the reigns and leading the datsys project forward. He's already made some great progress implementing his onyx-sim project, which is serving as the basis of our work couching datsys within the onyx paradigm. The goal is for both datsync (at least implementationally) and datview to take advantage of this work soon. I'm going to continue to advise and such, but Brian has his head way more in the game on this work than I have for a little while now, so if you're looking to chat more, I'd suggest you ping him :-)

On Tue, Jul 3, 2018 at 11:23 AM, Dustin Getz wrote:

Nice to meet you Brian. Chris: Okay, thanks. I may take a whack at cleaning it up (and removing the hyperfiddle bits) if you are okay with me refactoring your words. This might be worth it if I put it on reddit, is media attention something you and Brian would want or would you prefer to stay stealth for now?

On Tue, Jul 3, 2018 at 3:17 PM Christopher Small wrote:

Ya know... I'm skimming through this again and realizing that in rereading the exchange, I allowed the second half to overly color the initial bent of the conversation. There are details and framings which have evolved regarding the datsys stuff, but the first part of the discussion around RDF is actually pretty solid (I think; could use some feedback on this Brian).
(BTW, if you look at the docs and rationale for clojure.spec, you'll see RH directly cites RDF as prior art, so you can add that piece of evidence to my arguments.)
Now is not the best time for media or other attention on Datsys, as there are some things Brian is working on which are very close to done, and ideally all that would be in place by the time interested minds and fingers come knocking. If you have the energy for some refactoring now though, that could be super useful as a way of kickstarting better documentation around the project, and I'd be fine with you sort of cross-publishing this as a QA.
I even see that this could be cut apart into two conversations. The first one (about RDF) could be published without a lot of amendment. The second, about datsys specifically, could potentially be edited into something doing a good job of reflecting the current status and future direction. And maybe we could collaboratively edit that in a google doc or something? Or does hyperfiddle have similar collaborative features?
This is all my two cents; Ultimately, I leave it up to Brian what we publish when.
Brian - What do you think?
Dustin - Thanks so much for checking through with all this and offering your time.

Original Conversation from July 2017

---------- Forwarded message ---------
From: Dustin Getz
To: Christopher Small
Subject: off list Re: [ClojureScript] Re: How do you use GraphQL with Clojurescript?

On Sun, Jul 2, 2017 at 7:10 PM Dustin Getz wrote:

Hi Chris
The key is that as long as your attributes are globally meaningful, you can merge information about an entity without ever worrying about overwriting other information about that entity. This gives us extensibility. In Clojure and Datomic, these globally meaninful attributes are achieved via namespaced keywords, which is also: brilliant and simple
I've never heard anyone say this about namespaces before. Does this mean when you use datomic you essentially treat namespace as domain e.g. com.dustingetz? Have you been thinking about the global semantic nature of datomic, and how to merge a bunch of databases together?
I am very familiar with your research in datsys as described below, but did not realize you were drawing comparisons to the semantic web? This is something I have been thinking about a lot in the last year in my research..

On Mon, Jul 3, 2017 at 11:38 AM Christopher Small wrote:

Yes! I am drawing comparisons! And have been thinking about merging databases! I'm familiar with your work on hypercrud as well. Cool to see that you've also been thinking about the semantic web in relation :-) Great minds think alike? :-P
I don't typically literally use full domain names for my keyword namespaces (in Datomic or in Clojure generally). But we generally namespace our keywords according to Clojure namespaces in our Clojure programs, and so at least within the scope of Clojure programs, we get more or less the same effect. Of course, sometimes folks use imprecise namespaces, like :person/name, but this is generally discouraged over something like :myappns.person/name. And while I don't, I've worked on projects where the everything will be nested under a top level org or com namespace, giving :org.myappns.person/name, leading to a much more direct comparison with RDF namespacing. It's nice though that Datomic/Clojure give us this opt-in flexibility. Feeling like you have to register a domain name just to use RDF is part of why RDF hasn't "caught on" more generally, IMHO.
I seem to recall hearing RH talk about this connection, but I could be imagining it. It's certainly not widely advertized. Which perhaps is the magick sauce; synthesize and simplify a bunch of great ideas in need of polish. I'll let you know if I'm able to come up with a reference though.
How is your project going?
Side thought: Have you gotten into ontologies at all? There seems to be a lot of overlap with the ideas I've had in Datview, and I've been wanting to dig deeper there.

On Wed, Jul 5, 2017 at 3:33 PM Dustin Getz wrote:

In this screenshot Blue and Green are separate databases merged in one query. Presently there is a central attribute registry which ensures all databases get the same attributes. I envision though the database would link to any attribute sets it depends on, and then the constraint is that any attributes used in the query are equal in both databases. There could also perhaps be an ontology mapping function supplied. You may have thought more about this than me?
I remember an old post where Rich mentioned the connection to semantic and how they made a deliberate decision about being distributed or not. Is that the one you're referring to? I think the way they did it isn't a barrier. I know a lot of the distributed web blockchain tech focuses on having "one transactor", but I dont think that's essential. I wonder if the distributed web needs to mirror the way humans want to organize. Companies want to control the transactor, because they are responsible for keeping the database in sync with their code, evolving the code and database in lockstep over time. So Rich's decision makes sense, it makes a lot of sense for each organization to want to control writes by owning the transactor, and it is a huge uphill battle to convince them otherwise. The read side can still be federated. Have you thought about this? Bitcoin has some inertia with all the currency speculation but how many actual companies build actual apps backed by blockchain? Approximately zero right?
The system in the screenshot just uses straightforward datomic, but it is really easy to imagine it being better when backed by datsys. I started out thinking about data sync similar to you but it was easier to make progress on the composable UI problem by just ignoring the data sync problem, and I figured all along that if you guys solved some of those problems, we'd figure out a way to collaborate?

On Mon, Jul 17, 2017 at 10:54 AM Christopher Small wrote:

I have thought a bit about merging databases. In fact, we've decided Datsys is going to go the direction of CRDTs for distribution, which will let us do both P2P and more centralized distribution. And really, the only real key to buildig a CRDT is defining a transitive and commutative merge function. The sticky bits then are figuring out what to do with cardinality one, uniqueness and identity conflicts when merging, deterministically.
Blockchain is a little overhyped in my opinion. They seem to have their place, but for a lot of apps CRDTs, perhaps with end to end encryption, is quite a bit simpler, and more flexible. The only thing blockchains really seem to add is distributed trust, but there's a cost.

On Wed, Jul 19, 2017 at 6:20 PM Dustin Getz wrote:

I think my understanding of what you are building has fallen out of date.
I understand your 2016 stuff: Datsync tries to sync Datomic into Datascript. Datsync is difficult problem without reversible queries as outline in Web After Tomorrow. I believe you were exploring heuristics. Did you come to a conclusion if this was a tractable approach? On top of Datsync was Datview, which explores ultimate composable UI because all data is local.
Since 2017 you are exploring CRDTs, so not using Datomic at all? Datascript low level state is in a CRDT, so you can sync many Datascript browser peers and bypass Datomic entirely? And then posh subscribes to datascript and composable ui from there. So the work this summer is figuring exactly how to map Datascript to a CRDT, figuring out the merge semantics. Is this an open question or do you know the answer and are building it?
Is there a good place, a recent blog post or readme that i've missed to help catch up? Is the best place the datsys gitter archive link? Is there a better place than email for high level questions or should I just ask in the chat?

On Fri, Jul 28, 2017 at 2:50 PM Christopher Small wrote:

Hi Dustin
We're still on mission for being able to sync Datomic and DataScript. However, we're looking at CRDTs for addressing consistent shared world view between such nodes. It is rather interesting that you could end up bypassing Datomic and/or DataScript entirely, and maybe folks will want to do that. But I foresee that a lot will want to continue using a pattern along the lines of a central Datomic database kept in sync with some number of DataScript (or DataScript-esque Datsync) databases. In this case, the relationship between the Datomic and the CRDT would be a bit special in that if there was data in the CRDT that was inconsistent with the Datomic schema, Datomic would have the authority to resolve such conflicts.
We did look at heuristics with Posh, but unfortunately, Posh is limited in the kinds of Datalog queries that it can pattern match against for determining what queries need to be updated. For example, it does not cover Datalog rules, and maybe also not handle relational cycles appropriately. To get the rest of the way there, you'd end up having to more or less be reimplementing a lot of the logic of a Datalog engine. But if you only go half-way, you end up still having to do a full recompute at the end of the day. So our current thought is that instead of going half way, we should just build an incrementally maintained Datalog from the ground up. There has been some really interesting research on this of late, including some really promising work from some of the Microsoft Research group in a paper about Differential Dataflow. There are still a lot of details to work out, but this direction looks promising.
The datsys and datsync gitter chat rooms are probably the best place to stay up to date for now. We'll probably be putting up a blog and more documentation when things stabilize a bit.

On Fri, Jul 28, 2017 at 3:30 PM Dustin Getz wrote:

"However, we're looking at CRDTs for addressing consistent shared world view between such nodes ... the relationship between the Datomic and the CRDT would be a bit special in that if there was data in the CRDT that was inconsistent with the Datomic schema, Datomic would have the authority to resolve such conflicts"
I don't understand this bit. Are you saying DataScript is backed by a CRDT and a sync service near the transactor is watching the CRDT and transacting updates into datomic?

On Fri, Jul 28, 2017 at 4:03 PM Christopher Small wrote:

More or less. Except that DataScript isn't presently well suited to having a CRDT as backing storage. Particularly for durable storage, as on the client this would force the entire API to be async, a change Tonsky seems to have rejected in conversations. This is the other part of why we've looked at building our own incrementally maintained, eventually consistent Datalog and pull evaluation engine. It's really the only way to get client side storage where you might not be able to fit all of the data in memory.

On Sat, Jul 29, 2017 at 4:53 PM Dustin Getz wrote:

Okay, you originally wrote "we're also considering implementing a reactive, incrementally maintained, eventually consistent Datalog engine"
So you are thinking about rebuilding DataScript in a way that is compatible with CRDT backing storage? Are there parts of DataScript that can be reused, i would think no?
"This is the other part of why we've looked at building our own ... eventually consistent Datalog"
What is the first part?
You should totally copy paste this conversation into a blog post!

On Sun, Jul 30, 2017 at 4:22 AM Christopher Small wrote:

I know... I've been meaning to write a blog post. It's actually been really helpful chatting with you to run through the ideas. If I can convince you I'm not crazy maybe that will push me over the hump to put things in writing and start getting more broad feedback. So please tear me part in private, lest I make a fool of myself in public!
So... the other crazy thing we're thinking, and actually have fairly under way, is using onyx as a dataflow language. This is going to be core to the entire datsys design. Think re-frame/reagent, but instead of being restricted to the semantics of reactions, you have the entire expressive power of onyx for describing stream processing flows. The nice thing with this is that onyx completely separates the runtime from the description of the topology/data-flow. So we can run in the browser or on a cluster. We'll probably have a reagent/re-frame-esque DSL that makes people feel at home, but you'll be able to peak under the hood or extend in the full language of onyx. There are some poentially cool benefits to this: for one, CRDTs have this problem of garbage collection. You have to accumulate all this metadata about the state of the world in order to have enough information to resolve conflicts, and Onyx has this notion of windows that you could use to enforce a logical constraint that metadata only be kept up until some point of time. The consequence of this would be that every client would have to check in every so often in order to ensure that they'd be able to sync without having to resolve conflicts, which seems like a decent price to pay. And onyx windows give us a nice way of treating this. Just an example, but hopefully paints the picture of how the well thoguht out flexibility and functionality of onyx could come in handy.
To your question around "What is the first part?": My point was that needing asyncrony through and through in order to be able to have durable data that doesn't fit in memory is one concern. The other is being able to efficiently update queries or "state of the world" (especially with respect to "what data is currently checked out"). And that's where this incremental maintenance (and differential dataflow stuff) really shines.
To a certain extent, I think this view of it that "we should figure out, for any query q, what query x would give us all the datoms we'd need to execute q" is a bit of a red herring. In my mind, scope description is a datalog query for an entitiy ?e, and some pull expression of datoms and nested graph relationships for those entities ?e satisfying the query. Well... and a union of some number of such sets of datoms. If you can do that as a CRDT, where changes propogate both ways, you're more or less set.
And yes, the idea is that the CRDT itself is the backing storage for the results of this query, but can also be queried itself. And is also reactive, and can live on disk, etc. Open up a can of worms to be sorted through to be sure. But we've been working with Christian Weilbach who has been doing a ton of work with CRDTs and related ideas, and it feels like we have a path forward for this.
I think you're right that most of the DataScript code would probably not be resusable. The differential dataflow stuff, and some related stuff that we're looking at (still seeing where the cards fall here), end up looking very different computationally.
PS It's late so please tell me if anything needs clarification :-P

On Mon, Jul 31, 2017 at 9:35 AM Dustin Getz wrote:

This is great. It is clear enough. A lot of this content is new to me, it will take me some time to process this properly. I agree about writing things out though, talking to a person is 10x easier than writing at a blank piece of paper.
Can you state again at a high level what your overarching goal is? Is it still composable ui, or is it something else now? The following statement is still not clear to me what it means exactly.
"Datsync, a system for syncing Datomic/DataScript data between nodes in a network. Here, our goal is to extend the scope of what om-next describes by automating the synchronization of EAV triples given that this is what your data looks like ... the audacious goal of Datsync is to more or less one-up GraphQL by having the full expressiveness of Datalog+pull for the description of synchronization scope, together with the flexibility of RDF data for data modelling, and almost 0 hook-up"

On Tue, Aug 8, 2017 at 7:54 PM Christopher Small wrote:

The ultimate goal of datsys is to simplify and empower web programming around RDF/Datomic style data.
To this end, datsys aims to be the data communication layer. In current incarnation this is between Datomic and DataScript. En route with CRDTs and differential evaluation for efficient sync scope description though, we'll end up with something that can function as a standalone system (that is, without either Datomic or DataScript), and this will have confer benefits for reactive flows.
GraphQL is more or less isomorphic to the pull expression, and if we can express both it and Datalog in a differential/incremental fashion, then we'll offer a lot more expressiveness.