Datomic solves the O/R impedance mismatch

I will first demonstrate the O/R impedance mismatch by implementing an object relational mapping for a simple object model, and how minor tweaks to the object model cause superlinear increase in code complexity. Then I will demonstrate how Datomic can implement the same object model without a superlinear growth in complexity.

demo of the O/R impedance mismatch caused by SQL

So we want to work with some domain objects when we write our code

case class ProjectInfo(name: String,
                       owner: String,
                       created: java.util.Date)

we can make lists of them, and modify them, and store references to them, and work with them in memory however we like:

val a = ProjectInfo("Project A", "Dustin", DateTime.now)
val b = ProjectInfo("Project B", "Jason",  DateTime.now)
val c = ProjectInfo("Project C", "Toni",   DateTime.now)
val projects = List(a, b, c)

“Object relational mapping (ORM) specifies how sets of java objects, including references between them, are mapped to rows and columns in database tables” (EJB3 251). This must take into account inheritance, polymorphism, database schema normalization, references between objects, one-to-many (uni- and bi-directional) relations, many-to-many relations….

The ProjectInfo O/R mapping is simple:

CREATE TABLE projects (
  owner VARCHAR NOT NULL,
  name  VARCHAR NOT NULL,
  created TIMESTAMP NOT NULL
);

def readAllProjects(dbconn: java.sql.Connection): Try[List[ProjectInfo]] = Try {

  val query = "SELECT name, owner, created FROM projects"

  val mapping =
    get[String]("projects.owner") ~
    get[String]("projects.name") ~
    get[java.util.Date]("projects.created") map {
      case owner~name~created =>
        ProjectInfo(name, owner, created)
    }

  SQL(query).as(mapping *)(dbconn)
}

lets add another simple model:

case class Skill(name: String, description: String)

CREATE TABLE skills (
  name VARCHAR NOT NULL,
  description VARCHAR NOT NULL
);

def readAllSkills(dbconn: java.sql.Connection): Try[List[Skill]] = Try {

  val query = "SELECT name, description FROM skills"

  val mapping =
    get[String]("skills.name") ~
    get[String]("skills.description") map {
      case owner~name =>
        Skill(name, owner)
      }

  SQL(query).as(mapping *)(dbconn)
}

what if a Skill is nested in a Project, and a Skill may be shared across multple Projects? This doesn’t have as straightforward a mapping to SQL, so we need to add foreign key IDs to our database schema, and join across the two tables:

case class Skill(name: String, description: String)

case class ProjectInfo(name: String,
                       owner: String,
                       created: java.util.Date,
                       skill: Skill)


CREATE TABLE skills (
  id            VARCHAR(36) NOT NULL,
  name          VARCHAR NOT NULL,
  description   VARCHAR NOT NULL
);

CREATE TABLE projects (
  owner         VARCHAR NOT NULL,
  name          VARCHAR NOT NULL,
  created       TIMESTAMP NOT NULL,
  skillId       VARCHAR(36) NOT NULL
);


def readAllProjects(dbconn: java.sql.Connection): Try[List[ProjectInfo]] = Try {

  val query =
    """
      SELECT projects.name, projects.owner, projects.created,
             skills.name, skills.description
      FROM projects
      INNER JOIN skills ON projects.skillId = skills.id
    """

  val mapping =
    get[String]("projects.name") ~
    get[String]("projects.owner") ~
    get[java.util.Date]("projects.created") ~
    get[String]("skills.name") ~
    get[String]("skills.description") map {
      case owner~name~created~skillName~skillDesc =>
        val skill = Skill(skillName, skillDesc)
        ProjectInfo(name, owner, created, skill)
    }

  SQL(query).as(mapping *)(dbconn)
}

what if a Project has a collection of Skills? to do this you need IDs in your Project table, and a third “association table” (aka “intersection table”) to track a list of skills for each project – projects no longer manages his own pointers to skills.

case class Skill(name: String, description: String)

case class ProjectInfo(name: String,
                       owner: String,
                       created: java.util.Date,
                       skills: Set[Skill])

CREATE TABLE skills (
  id            VARCHAR(36) NOT NULL,
  name          VARCHAR NOT NULL,
  description   VARCHAR NOT NULL
);

CREATE TABLE projects (
  id            VARCHAR(36) NOT NULL,
  owner         VARCHAR NOT NULL,
  name          VARCHAR NOT NULL,
  created       TIMESTAMP NOT NULL
);

CREATE TABLE project_skills (
  project_id    VARCHAR(36) NOT NULL,
  skill_id      VARCHAR(36) NOT NULL
);


def readSkillsForProject(dbconn: java.sql.Connection, projectId: String): Try[Set[Skill]] = Try {

  val query =
    """
      SELECT skills.name, skills.description FROM skills
      INNER JOIN project_skills ON skills.id = project_skills.skill_id
      INNER JOIN projects ON projects.id = project_skills.project_id
      WHERE projects.id = {projectId}
    """

  val mapping =
    get[String]("skills.name") ~
    get[String]("skills.description") map {
      case name~desc =>
        Skill(name, desc)
    }

  SQL(query).on('projectId -> projectId).as(mapping *)(dbconn).toSet
}

def readAllProjects(dbconn: java.sql.Connection): Try[Set[ProjectInfo]] = Try {

  val query = "SELECT projcets.id, projects.name, projects.owner, projects.created FROM projects"

  val mapping =
    get[String]("projects.id") ~
    get[String]("projects.name") ~
    get[String]("projects.owner") ~
    get[java.util.Date]("projects.created") map {
      case id~name~owner~created =>
        id -> (name, owner, created)
    }

  val projects: Map[String, (String, String, java.util.Date)] = SQL(query).as(mapping *)(dbconn).toMap

  projects.map { case (projectId, project) =>
    lazy val skills: Set[Skill] = readSkillsForProject(dbconn, projectId).get
    ProjectInfo(project._1, project._2, project._3, skills)
  }.toSet
}

Note that this generates N+1 queries - one for the project, and N skill queries. This could of course be turned into one even more complicated join, but this combined query can become very large and costly to run as we add more attributes to our model:

case class ProjectInfo(name: String,
                       owner: String,
                       created: java.util.Date,
                       skills: Set[Skill],
                       as: Set[AModel],
                       bs: Set[BModel],
                       ...
                       ns: Set[NModel])

So we have to choose a balance between querying everything up front, and lazily running more queries to load attributes on demand. Even in the presence of database indexes, each query has a network cost.

All of this just so we can have a nice, object-oriented programming model for our application logic.

how to get rid of this spiraling complexity

All this complexity is introduced because SQL cannot nest tables inside of tables. An SQL table is limited to a set of primitive types; if you want to store a reference to an object in another table, we need to implement “reference” types ourselves, by introducing IDs to our models to act as pointers, then we need to write application code to dereference the pointers, all of this causing network traffic.

Document-oriented databases (like MongoDB) do support nesting objects, but MongoDB does not maintain ACID properties. For example, we can’t have an atomic transaction when we write to multiple objects.

Introducing Datomic

(TODO - i’ve blogged it a little bit in the past, or watch one of Rich Hickey’s keynotes)

Datomic does not introduce this complexity, while maintaining ACID properties

Datomic solves this by embracing immutability. Datomic is an immutable database. To understand this, lets compare to git, which is an immutable datastore. I’m going to go into much more detail (loading data, instead of assuming it’s there) because it helps explain what is going on.

here is the schema for the exact same object model as the last SQL example – where a ProjectInfo has a Set of Skills. build-schema comes from a datomic utility library Datomico, which keeps things nice and readable for the demo.

Note how the schema exactly reflects the object model. No explicit IDs, no third table to store collection relations.

(def Skill (build-schema :Skill [[:name         :string :unique]          ; name is a string with unique constraint
                                 [:description  :string]]))

(def ProjectInfo (build-schema :ProjectInfo [[:name     :string :unique]
                                             [:owner    :string]
                                             [:created  :instant]
                                             [:skills   :ref :many]]))    ; skills is a Set of References

lets add some Skill ‘entities’ (Datomic parlance for a record).

  (def skill-list [{:Skill/name        "Clojure"
                    :Skill/description "omg parens"}

                   {:Skill/name        "Datomic"
                    :Skill/description "i got 99 proiblems but state ain't one"}

                   {:Skill/name        "Scala"
                    :Skill/description "what is this OO malarky (just kidding)"}

                   {:Skill/name        "Java"
                    :Skill/description "every empire eventually falls"}

                   {:Skill/name        "Postgresql"
                    :Skill/description "sometimes it's cool to be old"}])

  (def txresult @(d/transact dbconn (data-with-dbid skill-list)))

the txresult contains references to the “before” and “after” value of the database (in git terms, that’s the commit we just made, and the parent commit). We can use this new database value to query the entity we just added

;; (def dbval (:db-before txresult)) ;if we did this, following queries return no results
(def dbval (:db-after  txresult))

(defn qe
  "helper function for queries returning exactly one entity"
  [query dbval & args]
  (let [res (apply d/q query dbval args)]
    (d/entity dbval (ffirst res))))

(defn read-skill-by-name [dbval skill-name]
  (qe '[:find ?e
         :in $ ?name
         :where [?e :Skill/name ?name]]
       dbval skill-name))

(->> (read-skill-by-name dbval "Clojure")   ;=> {:db/id 17592186045418}
     (d/touch))                             ;=> {:Skill/name "Clojure",
                                            ;    :Skill/description "omg parens",
                                            ;    :db/id 17592186045418}

We can use the queried entity references in new ProjectInfo entities. This is a departure from postgresql - the ProjectInfo entities can direcly contain references to other entities. In SQL, we need to store pointers to other records; this is first class in Datomic.

(def dbval (:db-after  txresult))

(let [clojure     (:db/id (read-skill-by-name dbval "Clojure"))
      scala       (:db/id (read-skill-by-name dbval "Scala"))
      datomic     (:db/id (read-skill-by-name dbval "Datomic"))
      java        (:db/id (read-skill-by-name dbval "Java"))
      postgresql  (:db/id (read-skill-by-name dbval "Postgresql"))

      artscentre  {:ProjectInfo/name "Artscentre"
                   :ProjectInfo/owner "Dustin"
                   :ProjectInfo/created (java.util.Date.)
                   :ProjectInfo/skills [clojure scala datomic]}

      tinyurl     {:ProjectInfo/name "Tinyurl"
                   :ProjectInfo/owner "Kevin"
                   :ProjectInfo/created (java.util.Date.)
                   :ProjectInfo/skills [java postgresql]}]

  (def txresult @(d/transact dbconn (data-with-dbid [artscentre tinyurl]))))

and then we can inspect the values

  (def dbval (:db-after  txresult))

  (->> (read-all-projects dbval)    ; set of all projects
       (first)                      ; get the first ProjectInfo (lazy attrs)
       (d/touch)                    ; resolve all the projectinfo's attrs
       (:ProjectInfo/skills)        ; get this ProjectInfo's skills (lazy)
       (mapv d/touch))              ; resolve all the skills

;=> [{:Skill/name "Scala", :Skill/description "what is this OO malarky (just kidding)", :db/id 17592186045420}
     {:Skill/name "Datomic", :Skill/description "i got 99 proiblems but state ain't one", :db/id 17592186045419}
     {:Skill/name "Clojure", :Skill/description "omg parens", :db/id 17592186045418}]

TLDR: Datomic lets you keep the schema as intuitive as your object model, while preserving all the usual acid properties.

Here is the complete runnable code for use in a repl.

how can Datomic preserve ACID but also be performant?

(Which is the whole point - Datomic is the first database that lets you store nested objects, with ACID, with high performance)

It’s because of immutability. All prior databases are based on mutable data structures. From the Datomic docs:

Peers read facts from the Storage Service. The facts the Storage Service returns never change, so Peers do extensive caching. Each Peer’s cache represents a partial copy of all the facts in the database. The Peer cache implements a least-recently used policy for discarding data, making it possible to work with databases that won’t fit entirely in memory. Once a Peer’s working set is cached, there is little or no network traffic for reads.

Peers write new facts by asking the Transactor to add them to the Storage Service. The Transactor processes these requests using ACID transactions, ensuring they succeed or fail atomically and do not interfere with one another. The Transactor notifies all Peers about new facts so that they can add them to their caches.

The git analogy holds well – imagine you are running git fetch in a tight loop in the background, so you always know the about the latest commits. I asked Stuart Sierra in person how long it takes for a Peer to become aware of another Peer’s writes, and he says if your Peers are in the same datacenter as the Transactor it happens in single digit milleseconds. So your reads are always consistent, they may be a few milleseconds stale.

I may blog about this in the future if enough people tweet at me.

References