Atomic Commits

Disclaimer: Work in progress, prone to change.

Atomic Commits is a proposed standard for communicating state changes (events / transactions / patches / deltas / mutations) of Atomic Data. It is the part of Atomic Data that is concerned with writing, editing, removing and updating information.

Design goals

  • Event sourced: Store and standardize changes, as well as the current state. This enables versioning, history playback, undo, audit logs, and more.
  • Traceable origin: Every change should be traceable to an actor and a point in time.
  • Verifiable: Have cryptographic proof for every change. Know when, and what was changed by whom.
  • Identifiable: A single commit has an identifier - it is a resource.
  • Decentralized: Commits can be shared in P2P networks from device to device, whilst maintaining verifiability.
  • Extensible: The methods inside a commit are not fixed. Use-case specific methods can be added by anyone.
  • Streamable: The commits could be used in streaming context.
  • Familiar: Introduces as little new stuff as possible (no new formats or language to learn)
  • Pub/Sub: Subscribe to changes and get notified on changes.
  • ACID-compliant: An Atomic commit will only occur if it results in a valid state.
  • Atomic: All the Atomic Data design goals also apply here.

Motivation

Although it's a good idea to keep data at the source as much as possible, we'll often need to synchronize two systems. For example when data has to be queried or indexed differently than its source can support. Doing this synchronization can be very difficult, since most of our software is designed to only maintain and share the current state of a system.

I noticed this mainly when working on OpenBesluitvorming.nl - an open data project where we aimed to fetch and standardize meeting data (votes, meeting minutes, documents) from 150+ local governments in the Netherlands. We wrote software that fetched data from various systems (who all had different models, serialization formats and APIs), transformed this data to a single standard and share it through an API and a fulltext search endpoint. One of the hard parts was keeping our data in sync with the sources. How could we now if something was changed upstream? We queried all these systems every night for all meetings from the next and previous month, and made deep comparisons to our own data.

This approach has a couple of issues:

  • It costs a lot of resources, both for us and for the data suppliers.
  • It's not real-time - we can only run this once every 24 ours (because of how costly it is).
  • It's very prone to errors. We've had issues during all phases of Extraction, Transformation and Loading (ETL) processing.
  • It causes privacy issues. When some data at the source is removed (because it contained faulty or privacy sensitive data), how do we learn about that?

Persisting and sharing state changes could solve these issues. In order for this to work, we need to standardize this for all data suppliers. We need a specification that is easy to understand for most developers.

Keeping track of where data comes from is essential to knowing whether you can trust it - whether you consider it to be true. When you want to persist data, that quickly becomes bothersome. Atomic Data and Atomic Commits aim to make this easier by using cryptography for ensuring data comes from some particular source, and is therefore trustworthy.

FAQ

Is Atomic Commits a Conflict-free Replicated Data Type (CRDT)?

Since Atomic Data always has a clear owner, all changes are coming from a single source or truth. This prevents a lot of the issues that CRDT aims to solve, such as two people working on the same word at the same time in some text editor.

How does it compare to other delta formats?

See the compare section