Atomic Data Overview

Status: early draft, far from usable. Feedback welcome.

Atomic Data is a proposed standard for modeling and exchanging data. It uses links to connect pieces of data, and therefore makes it easier to connect datasets to each other - even when these datasets exist on separate machines.

Atomic Data is especially suitable for knowledge graphs, distributed datasets, semantic data, p2p applications, decentralized apps, and data that is meant to be highly reusable. It is designed to be highly extensible, easy to use, and to make the process of domain specific standardization as simple as possible.

Atomic Data is Linked Data, as it is a more strict subset of RDF. It is typed (you know if something is a string, number, URL, etc.) and extensible through Atomic Schema, which means that you can define your own Classes, Properties and Datatypes. Atomic Data has a standard for synchronizing data by communicating state changes, called Atomic Mutations. You can use parts of Atomic Data separately, but the standard is designed as a full, integrated data management package that makes it easier to create, share and use structured data on the web.

Motivation

Linked data (RDF / the semantic web) enables us to use the web as a large, decentralized graph database. Using links everywhere in data has amazing merits: links remove ambiguity, they enable exploration, they enable connected datasets. Linked Data could help to democratize the web by decentralizing information storage, and giving people more control. The Solid Project by Tim Berners-Lee is a great example of why linked data can help to create a more decentralized web.

At Ontola, we've been working with linked data quite intensely for the last couple of years. We went all-in on RDF, and challenged ourselves to create software that communicates exclusively using it. That has been an inspiring, but also sometimes a frustrating journey. While building various production grade apps (e.g. our e-democracy platform Argu.co, which is used by various governments), we had to solve many problems. How to properly model data in RDF? How to deal with sequences? How do you deal with mutations? Converting to HTML? Typing? CORS? We tackled some of these problems by having a tight grip on the data that we create (e.g. we know the type of data, because we control the resources), and another part is creating new protocols, formats, tools, and libraries. But it took a long time, and it was hard. It's been almost 15 years since the introduction of linked data, and its adoption has been slow. We know that some of its merits are undeniable, and we truly want the semantic web to succeed. We believe the lack of growth partially has to do with a lack of tooling, but also with some problems that lie in the RDF data model.

Atomic Data aims to take the best parts from RDF, and learn from the past to make a more developer-friendly, performant and reliable data model to achieve a truly linked web.

Atomic Data Core

The Atomic Data Core describes the fundamental data model of Atomic Data. Before we dive into its concepts, we'll talk about why this standard is made in the first place.

Design goals

  • Browsable: Data should explicitly link to other pieces of data, and these links should be followable.
  • Semantic: Every data Atom and relation has a clear semantic meaning.
  • Open: Free to use, open source, no strings attached.
  • Clear Ownership: The data shows who is in control of the data, so new versions of the data can easily be retrieved.
  • Mergeable: Any two sets of Atoms can be merged into a single graph without any merge conflicts / name collisions.
  • Interoperable: Can easily and constantly be converted to other data formats (e.g. JSON, XML, and all RDF formats).
  • Extensible: Anyone can define their own data types and create Atoms with it.
  • ORM-friendly: Navigate a decentralized graph by using dot.syntax, similar to how you navigate a JSON object in javascript.
  • Typed: All valid Atomic data has an unambiguous, static datatype. Models expressed in Atomic Data can be mapped to programming language models, such as structs or interfaces in Typescript / Rust / Go.

Note that for these last four goals, Atomic Schema is required.

When should you use Atomic Data

  • Flexible schemas. When dealing with structured wikis or semantic data, various instances of things will have different attributes. Atomic Data allows any kind of property on any resource.
  • High-value open data. Atomic Data is a bit harder to create, but it is easier to re-use and understand. It's use of URLs for properties makes data self-documenting.
  • Standardization is important. When multiple groups of people have to use the same schema, Atomic Data provides easy ways to constrain and validate the data.
  • Multi-class / multi-model. Contrary to (SQL) tables, Atomic Data allows a single thing to have multiple classes, each with their own properties.
  • Connected / decentralized data. With Atomic Data, you use URLs to point to things on other computers. This makes it possible to connect datasets very explicitly, without creating copies.
  • Interactive data. When users need to make changes
  • RDF as Output. Atomic Data serializes to idiomatic, clean RDF (Turtle / JSON-LD / n-triples / RDF/XML).

When not to use Atomic Data

  • Internal use only. If you're not sharing structured data, Atomic Data will probably only make things harder for you.
  • Big Data. If you're dealing with TeraBytes of data, you probably don't want to use Atomic Data.

Atomic Data Core: Concepts

Understanding the Core concepts of Atomic Data are fundamental for reading the rest of the documentation.

Atomic Data

Atomic Data is a data model for sharing information on the web. It is a directed, labeled graph, similar to RDF. It can be used to express any type of information, including personal data, vocabularies, metadata, documents, files and more. Contrary to some other (labeled) graph data models, a relationship between two items (Resources) does not have attributes. It's designed to be easily serializable to both JSON and linked data formats.

Atom (or Atomic Triple)

The smallest possible piece of meaningful data / information. You can think of an Atom as a single cell in a spreadsheet or database. An Atom consists of three fields:

  • Subject: the Thing that the atom is providing information about.
  • Property: the property of the Thing that the atom is about (will always be a URL to a Property).
  • Value: the new piece of information about the Atom.

If you're familiar with RDF, you'll notice similarities. An Atom is comparable with an RDF Triple / Statement (although there are important differences).

Let's turn this sentence into Atoms:

Arnold Peters, who's born on the 20th of Januari 1991, has a best friend named Britta Smalls.

SubjectPropertyValue
Arnoldlast namePeters
Arnoldbirthdate1991-01-20
Arnoldbest friendBritta
Brittalast nameSmalls

The table above shows easily readable strings, but in reality, Atomic Data will almost exclusively consist of links (URLs). The standard serialization format for Atomic Data is AD3, which looks like this:

["https://example.com/arnold","https://example.com/properties/lastname","Peters"]
["https://example.com/arnold","https://example.com/properties/birthDate","1991-01-20"]
["https://example.com/arnold","https://example.com/properties/bestFriend","https://example.com/britta"]
["https://example.com/britta","https://example.com/properties/lastname","Smalls"]

In the Atomic Data above, we have:

  • four different Atoms (every line is an Atom)
  • two different Subjects: https://example.com/arnold and https://example.com/britta.
  • three different Properties (https://example.com/properties/bornAt, https://example.com/properties/firstName, and https://example.com/properties/bestFriend)
  • four different Values (1991-01-20, Arnold, https://example.com/britta and Britta)

All Subjects and Properties are Atomic URLs: they are links that point to more Atomic Data. One of the Values is a URL, too, but we also have values like Arnold and 1991-01-20. These Values have different Datatypes In most other data formats, the datatypes are limited and are visually distinct. JSON, for example, has array, object, string, number and boolean. In Atomic Data, however, datatypes are defined somewhere else, and are extendible. To find the Datatype of an Atom, you fetch the Property, and that one has a Datatype. For example, the https://example.com/properties/bornAt Property requires an ISO Date string, and the https://example.com/properties/firstName Property requires a regular string. This might seem a little tedious and weird at first, but is has some nice advantages! Their Datatypes are defined in the Properties.

Subject field

The Subject field is the first part of an Atom. It is the identifier that the rest of the Atom is providing information about. The Subject field is a URL that points to the Resource. The creator of the Subject MUST make sure that it resolves. In other words: following / downloading the Subject link will provide you with all the Atoms about the Subject (see Atomic Querying. This also means that the creator of a Resource must make sure that it is available at its URL - probably by hosting the data, or by using some service that hosts it.

Property field

The Property field is the second part of an Atom. It is a URL that points to an Atomic Property. For example https://example.com/createdAt or https://example.com/firstName.

The Property field MUST be a URL, and that URL MUST resolve to an Atomic Property, which contains information about the Datatype.

Value field

The Value field is the third part of an Atom. In RDF, this is called an object. Contrary to the Subject and Property values, the Value can be of any datatype. This includes URLs, strings, integers, dates and more.

Graph

A Graph is a set of Atoms. A Graph can describe various subjects, and may or may not be related. Graphs can have several characteristics (Schema Complete, Valid, Closed)

Resource

A Resource is a set of Atoms (a Graph) that share the same Subject URL. You can think of a Resource as a single row in a spreadsheet or database. In practice, Resources can be anything - a Person, a Blogpost, a Todo item. A Resource consists of at least one Atom, so it always has some Property and some Value. The most important Property of a Resource is the isa Property, which refers to which Class it belongs (e.g. Person or Blogpost). A Class can specify required and recommended properties. More on that in the Atomic Schema chapter!

Atomic Web

The Atomic Web refers to all Atomic Graphs on the web.

Serialization of Atomic Data

Atomic Data is not necessarily bound to a single serialization format. It's fundamentally a data model, and that's an important distinction to make. However, it's recommended to use ad3, which is specifically designed to be a simple and performant format for Atomic Data.

Atomic Data is designed to be serializable to idiomatic (clean, nice) JSON. It's also serializable to RDF, which includes Turtle, N-triples, RDF/XML and other serialization formats.

AD3

AD3 stands for Atomic Data Triples, and it's the simplest and fastest way to serialize / parse Atomic Data.

AD3 represents a single Atom as a single line, containing a JSON array of three strings, respectively representing the Subject, Property and Value.

It looks like this:

["https://example.com/subject","https://example.com/property","some object"]
["https://example.com/subject","https://example.com/otherProperty","https://example.com/somethingelse"]

It uses Newline Delimited JSON (NDJSON) for serialization, which is just a large string with newlines between each JSON object.

NDJSON has some important benefits:

  • Streaming parsing: An NDJSON document can be parsed before it's fully loaded / transmitted. That is not possible with regular JSON.
  • High compatibility: NDJSON parsers can use JSON parsers, and are therefore everywhere.
  • Performance: Modern browsers have highly performant JSON parsing, which means that it's fast in one of the most important contexts: the browser.

Mime type (not registered yet!): application/ad3-ndjson

File name extension: .ad3

Disclaimer: note that AD3 is useful for communicating current state, but not for state changes.

You can validate AD3 at atomicdata.dev/validate.

Atomic Triples is heavily inspired by HexTuples-NDJSON.

Example serialization implementation written in Rust, to show you how easy it is to serialize this!


#![allow(unused_variables)]
fn main() {
pub fn serialize_atoms_to_ad3(atoms: Vec<Atom>) -> AtomicResult<String> {
    let mut string = String::new();
    for atom in atoms {
        // Use an exsting JSON serialization library to take care of the hard work (escaping quotes, etc.)
        let mut ad3_atom = serde_json::to_string(&vec![&atom.subject, &atom.property, &atom.value])?;
        ad3_atom.push_str("\n");
        &string.push_str(&*ad3_atom);
    }
    return Ok(string);
}
}

And an example parser:


#![allow(unused_variables)]
fn main() {
pub fn parse_ad3<'a, 'b>(string: &'b String) -> AtomicResult<Vec<Atom>> {
    let mut atoms: Vec<Atom> = Vec::new();
    for line in string.lines() {
        match line.chars().next() {
            // These are comments
            Some('#') => {}
            Some(' ') => {}
            // That's an array, let's do this!
            Some('[') => {
                let string_vec: Vec<String> =
                    parse_json_array(line).expect(&*format!("Parsing error in {:?}", line));
                if string_vec.len() != 3 {
                    return Err(format!(
                        "Wrong length of array at line {:?}: wrong length of array, should be 3",
                        line
                    )
                    .into());
                }
                let subject = &string_vec[0];
                let property = &string_vec[1];
                let value = &string_vec[2];
                atoms.push(Atom::new(subject, property, value));
            }
            Some(char) => {
                return Err(format!(
                    "AD3 Parsing error at {:?}, cannot start with {}",
                    line, char
                )
                .into())
            }
            None => {}
        };
    }
    return Ok(atoms);
}
}

AD2

AD2 (Atomic Data Doubles) is similar to AtomicTriples, with one exception: the Subject is left out. For many use-cases, omitting the Subject is a bad idea - you'll most often need AD2! having no subject means that you can't describe multiple resources in a single document, and that is useful in many contexts.

However, omitting the subject can be useful in (at least) three scenarios:

  • The Subject is not yet known when creating the data (for example, because it still has to be determined by some server or hash function).
  • The Subject is already known by the client, and leaving it out saves bandwidth. This happens for example during Subject Fetching, where the request itself contains the Subject, because the fetched URL itself is the Subject of all returned triples. Note that in this scenario, the server is unable to include
  • The Atoms are only valid coming from a specific source. Since
["https://example.com/property","some object"]
["https://example.com/otherProperty","https://example.com/somethingelse"]

Keep in mind that this approach also has some downsides:

  • It becomes impossible to include other resources in a single serialized document / response.

  • Mime type (not registered yet!): application/ad2-ndjson

  • File name extension: .ad2

RDF serialization formats

Because of the similarities with RDF, RDF serialization formats can be used to communicate and store Atomic Data, such as N-Triples, Turtle, HexTuples or JSON-LD. However, keep in mind that RDF users will expect other things from their data. Read more about the various existing formats and their respective merits here. Read more about serializing Atomic Data to RDF in the RDF interoperability section.

Future serialization formats

In the future, new serialization formats will be introduced. Here's some (vague) ideas that might inspire you to design one:

AtomicData-FS

Possible extension: .adf

FS stands for FileSystem. It should be designed as a format that's easy to manipulate Atomic Data by hand, using plaintext editors and IDE software. It fits nicely in our line-based paradigm, where we us IDEs and Github to manage our information. It should use Shortnames wherever possible to make life easier for those who modify instances. It might use hierarchical path structures to shape URLs. It might use hierarchical path structures to shape data, and set constraints (e.g. all items directly in the ./person directory should be Person instances). Folder structure should reflect the structure inside URLs.

Note that this format is not useful for sending arbitrary Atomic Data to some client. It is useful for managing Atomic Data from a filesystem.

An example AtomicData-FS dir can be found in the repo.

# in ./projectDir/people/john.adf
# serialization uses YAML syntax
firstName: John
lastName: McLovin
# If a Property is not available in the Class, you can the URL of the property
https://schema.org/birthDate: 1991-01-20
# Perhaps support relative paths to other local resources
bestFriend: ./mary

Perhaps YAML isn't the right pick for this, because it's kind of hard to parse.

AtomicData-Binary

Possible extension: .adb

A binary serialization format, designed to be performant and highly compressed. Perhaps it works like this:

  • An adb file consists of a large sequence of Maps and Statements
  • A Map is a combination of an internal identifiers (the ID, some short binary object) and a URL strings. These make sure that URLs can be used again cheaply, if they are used multiple times.
  • A Statement is a set of two IDs and a value, which can be a String, a URL or some binary format.
  • Perhaps some extra compression is possible, because many URLs will have a common domain.

Querying Atomic Data

There are multiple ways of getting Atomic Data into some system:

  • Atomic Paths is a simple way to traverse Atomic Graphs and target specific values
  • Subject Fetching requests a single subject right from its source
  • Triple Pattern Fragments allows querying for specific (combinations of) Subject, Property and Value.
  • SRARQL is a powerful Query language for traversing graphs

Atomic Paths

An Atomic Path is a string that consist of one or more URLs, which when traversed point to an item. For more information, see Atomic Paths.

Subject fetching (HTTP)

The simplest way of getting Atomic Data when the Subject is an HTTP URL, is by sending a GET request to the subject URL. Set the Content-Type header to an Atomic Data compatible mime type, such as application/ad3-ndjson.

GET https://example.com/myResource HTTP/1.1
Content-Type: application/ad3-ndjson

The server SHOULD respond with all the Atoms of which the requested URL is the subject:

HTTP/1.1 200 OK
Content-Type: application/ad3-ndjson
Connection: Closed

["https://example.com/myResource","https://example.com/properties/name","My awesome resource!"]

The server MAY also include other resources, if they are deemed relevant.

Subject Fetching (IPFS)

IPFS is a new protocol for sharing data using content-addressing.

Triple Pattern Fragments

Triple Pattern Fragments (TPF) is an interface for querying RDF. It works great for Atomic Data as well.

An HTTP implementation of a TPF endpoint might accept a GET request to a URL such as this:

http://example.org/tpf?subject={subject}&property={property}&value={value}

Make sure to URL encode the subject, property, value strings.

For example, let's search for all Atoms where the value is test.

GET https://example.com/tpf?value="test" HTTP/1.1
Content-Type: application/ad3-ndjson

This is the HTTP response:

HTTP/1.1 200 OK
Content-Type: application/ad3-ndjson
Connection: Closed

["https://example.com/myResource","https://example.com/properties/name","test"]

SPARQL

SPARQL is a powerful RDF query language. Since all Atomic Data is also valid RDF, it should be possible to query Atomic Data using SPARQL.

Atomic Paths

An Atomic Path is a string that consists of at least one URL, followed by one or more URLs or Shortnames. Every single value in an Atomic Resource can be targeted through such a Path. They can be used as identifiers for specific Values.

The simplest path, is the URL of a resource, which represents the entire Resource with all its properties. If you want to target a specific atom, you can use an Atomic Path with a second URL. This second URL can be replaced by a Shortname, if the Resource is an instance of a class which has properties with that shortname (sounds more complicated than it is).

Example

Let's start with this simple graph:

["https://example.com/john", "https://example.com/lastName", "McLovin"]

Then the following Path targets the McLovin value:

https://example.com/john https://example.com/lastname => McLovin

If the resource is an instance of a Class (an atomic:isA property), you can use the Shortnames of the Properties that are referred to by that class. Since John is an instance of a Person, he might have a lastname which maps to https://example.com/latname.

https://example.com/john lastname => McLovin

We can also traverse relationships:

["https://example.com/john", "https://atomicdata.dev/properties/isA", "https://example.com/Person"]
["https://example.com/john", "https://example.com/lastName", "McLovin"]
["https://example.com/john", "https://example.com/employer", "https://example.com/XCorp"]
["https://example.com/XCorp", "https://example.com/description", "The greatest company!"]

https://example.com/john employer description => The greatest company!

In the example above, the XCorp subject exists and is the source of the The greatest company! value. However, using paths, it's also possible to created nested resources without creating new URLs for all children.

Nested Resources

All Atomic Data Resources that we've discussed so far have a URL as a subject. Unfortunately, creating unique and resolvable URLs can be a bother, and sometimes not necessary. If you've worked with RDF, this is what Blank Nodes are used for. In Atomic Data, we have something similar: Nested Resources.

Let's use a Nested Resource in the example from the previous section:

["https://example.com/john", "https://example.com/lastName", "McLovin"]
["https://example.com/john https://example.com/employer", "https://example.com/description", "The greatest company!"]

By combining two Subject URLs into a single string, we've created a nested resource. The Subjet of the nested resource is https://example.com/john https://example.com/employer, including the spacebar.

Note that the path from before still resolves:

https://example.com/john employer description => The greatest company!

Serialization formats are free to use nesting to denote paths - which means that it is not necessary to include these path strings explicitly in most serialization formats.

For example:

{
  "@id": "https://example.com/john",
  "@context": "https://example.com/person",
  "hasShoes": [
    {
      "name": "Mr. Boot",
    },
    {
      "name": "Sunny Sandals",
    }
  ]
}

The Path of Mr. Boot is:

https://example.com/john hasShoes 1 name

This Path is useful for storing the value in other serialization formats, such as .ad3:

["https://example.com/john https://example.com/hasShoes 0", "https://example.com/name", "Mr. Boot"]
["https://example.com/john https://example.com/hasShoes 1", "https://example.com/name", "Sunny Sandals"]

You can target an item in an array by using a number to indicate its position, starting with 0.

Notice how the Resource with the name: Mr. Boot does not have an explicit @id, but it does have a Path.

Atomic Schema

Atomic Schema is the proposed standard for specifying classes, properties and datatypes in Atomic Data. You can compare it to what XSD is for XML. Atomic Schema deals with the

This section will define various Classes, Properties and Datatypes (discussed in Atomic Core: Concepts).

Design Goals

  • Typed: Every Atom of data has a clear datatype.
  • IDE-friendly: You should not have to type full URLs - the schema sets shortnames.
  • Self-documenting: When seeing a piece of data, simply following links will explain you how the model is to be understood. This removes the need for (most of) existing API documentation.
  • Performant: Datatypes can have a binary representation for optimal storage, communication, serialization and parsing efficiency.
  • Extensible: Anybody can create their own Datatypes, Properties and Classes.
  • Accessible: Support for languages, easily translatable. Useful for humans and machines.
  • Atomic: All the design goals of Atomic Data itself also apply here.
  • Self-describing: Atomic Schema is to be described as Atomic Data using Atomic Schema.

In short

In short, Atomic Schema works like this:

The Property field in an Atom links to a Property Resource. It is important that the URL to the Property Resource resolves. This Property does three things:

  1. it tells something about its semantic meaning, and links to a Datatype.
  2. it links to a Datatype or Class, which indicates which Value is acceptable.
  3. it provides a Shortname, which is used for ORM.

DataTypes define the shape of the Value, e.g. a Number (124) or Boolean (true).

Classes are a special kind of Resource that describe an abstract class of things (such as "Person" or "Blog"). Classes can recommend or require a set of Properties. They behave as Models, similar to struts in C or interfaces in Typescript. A Resource could have one or more classes, which could provide information about which Properties are expected or required.

Atomic Schema: Classes

How to read classes

Example:

  • description - (required, AtomicURL, TranslationBox) human readable explanation of what the Class represents.

Means:

This class has a required property with shortname description. This Property has a Datatype of AtomicURL, and these should point to TranslationBox instances.

Note: the URLs for properties are missing and will be added at a later time.

Property

URL: https://atomicdata.dev/classes/Property

The Property class. The thing that the Property field should link to. A Property is an abstract type of Resource that describes the relation between a Subject and a Value. A Property provides some semantic information about the relationship (in its description), it provides a shorthand (the shortname) and it links to a Datatype. Here's a list of useful Properties. You can constrain properties further by using SHACL Properties.

Properties of a Property instance:

  • shortname - (required, Slug) the shortname for the property, used in ORM-style dot syntax (thing.property.anotherproperty).
  • description - (optional, AtomicURL, TranslationBox) the semantic meaning of the.
  • datatype - (required, AtomicURL, Datatype) a URL to an Atomic Datatype, which defines what the datatype should be of the Value in an Atom where the Property is the
  • classtype - (optional, AtomicURL, Class) if the datatype is an Atomic URL, the classtype defines which class(es?) is (are?) acceptable.
["https://example.com/properties/createdAt","https://atomicdata.dev/property/shortname","createdAt"]
["https://example.com/properties/createdAt","https://atomicdata.dev/property/datatype","https://atomicdata.dev/datatype/datetime"]

Datatype

URL: https://atomicdata.dev/classes/Datatype

A Datatype specifies how a Value value should be interpreted. Datatypes are concepts such as boolean, string, integer. Since DataTypes can be linked to, you dan define your own. However, using non-standard datatypes limits how many applications will know what to do with the data.

Properties:

  • description - (required, AtomicURL, TranslationBox) how the datatype functions.
  • stringSerialization - (required, AtomicURL, TranslationBox) how the datatype should be parsed / serialized as an UTF-8 string
  • stringExample - (required, string) an example stringSerialization that should be parsed correctly
  • binarySerialization - (optional, AtomicURL, TranslationBox) how the datatype should be parsed / serialized as a byte array.
  • binaryExample - (optional, string) an example binarySerialization that should be parsed correctly. Should have the same contents as the stringExample. Required if binarySerialization is present on the DataType.

Class

URL: https://atomicdata.dev/classes/Class

A Class is an abstract type of Resource, such as Person. It is convention to use an Uppercase in its URI. Note that in Atomic Data, a Resource can have several Classes - not just a single one. If you need to set more complex constraints to your Classes (e.g. maximum string length, Properties that depend on each other), check out SHACL.

Properties:

  • shortname - (required, Slug) a short string shorthand.
  • description - (required, AtomicURL, TranslationBox) human readable explanation of what the Class represents.
  • requires - (optional, ResourceArray, Property) a list of Properties that are required. If absent, none are required. These SHOULD have unique shortnames.
  • recommends - (optional, ResourceArray, Property) a list of Properties that are recommended. These SHOULD have unique shortnames.
  • deprecatedProperties - (optional, ResourceArray, Property) - a list of Properties that should no longer be used.

A resource indicates it is an instance of that class by adding a https://atomicdata.dev/properties/isA Atom.

Example:

["https://example.com/classes/Person","https://atomicdata.dev/properties/isA","https://atomicdata.dev/classes/Class"]
["https://example.com/classes/Person","https://atomicdata.dev/properties/recommends","https://example.com/classes/Person/recommends"]
["https://example.com/classes/Person/recommends","https://atomicdata.dev/properties/isA","https://atomicdata.dev/dataTypes/ResourceArray"]

Atomic Schema: Datatypes

The Atomic Datatypes consist of some of the most commonly used Datatypes.

Slug

URL: https://atomicdata.dev/datatypes/slug

A string with a limited set of allowed characters, used in IDE / Text editor context. Only letters, numbers and dashes are allowed.

Regex: ^[a-z0-9]+(?:-[a-z0-9]+)*$

Atomic URL

URL: https://atomicdata.dev/datatypes/atomicURL

A URL that should resolve to an Atomic Resource.

URI

URL: https://atomicdata.dev/datatypes/URI

A Uniform Resource Identifier, preferably a URL (i.e. an URI that can be fetched). Could be HTTP, HTTPS, or any other type of schema.

String

URL: https://atomicdata.dev/datatypes/string

UTF-8 String, no max character count. Newlines use backslash escaped \n characters. Should not contain language specific data, use a TranslationBox instead.

e.g. String time! \n Second line!

Markdown

URL: https://https://atomicdata.dev/datatypes/markdown

A markdown string, using the CommonMark syntax. UTF-8 formatted, no max character count, newlines are \n.

e.g.

# Heading

Paragraph with [link](https://example.com).

Integer

URL: https://atomicdata.dev/datatypes/integer

Signed Integer, max 64 bit. Max value: 9223372036854775807

e.g. -420

Float

URL: https://atomicdata.dev/datatypes/float

Number with a comma. Max value: 9223372036854775807

e.g. -420

Boolean

URL: https://atomicdata.dev/datatypes/boolean

True or false, one or zero.

String serialization

true or false.

Binary serialization

Use a single bit one boolean.

1 for true, or 0 for false.

Date

ISO date without time. YYYY-MM-DD.

e.g. 1991-01-20

Timestamp

URL: https://atomicdata.dev/datatypes/timestamp

Similar to Unix Timestamp. Milliseconds since midnight UTC 1970 jan 01 (aka the Unix Epoch). Use this for most DateTime fields. Signed 64 bit integer (instead of 32 bit in Unix systems).

e.g. 1596798919 (= 07 Aug 2020 11:15:19)

ResourceArray

URL: https://atomicdata.dev/datatypes/resourceArray

Sequential, ordered list of Atomic URIs. Serialized as a JSON array with strings. Note that other types of arrays are not included in this spec, but can be perfectly valid. (discussion)

  • e.g. ["https://example.com/1", "https://example.com/1"]

Atomic Translations

Dealing with translations can be hard. (See discussion on this subject here.)

TranslationBox

URL: https://atomicdata.dev/classes/TranslationBox

A TranslationBox is a collection of translated strings, uses to provide multiple translations. It has a long list of optional properties, each corresponding to some language. Each possible language Property uses the following URL template: https://atomicdata.dev/languages/{langguageTag}. Use a BCP 47 language tag, e.g. nl or en-US.

For example:

["https://example.com/john","https://example.com/lifestory","https://example.com/johns/lifestory"]
["https://example.com/johns/lifestory","https://atomicdata.dev/langs/en-US","Well, John was born and later he died."]
["https://example.com/johns/lifestory","https://atomicdata.dev/langs/nl","Tsja, John werd geboren en stierf later."]

Every single property used for Translation strings are instances of the Translation class.

A translation string uses the MDString datatype, which means it allows Markdown syntax.

Atomic Schema FAQ

How do I create a Property that supports multiple Datatypes?

A property only has one single Datatype. However, feel free to create a new kind of Datatype that, in turn, refers to other Datatypes. Perhaps Generics, or Option like types should be part of the Atomic Base Datatypes.

How should a client deal with Shortname collisions?

Atomic Data guarantees Subject-Property uniqueness, which means that Valid Resources are guaranteed to have only one of each Property. Properties offer Shortnames, which are short strings. These strings SHOULD be unique inside Classes, but these are not guaranteed to be unique inside all Resources. Note that Resources can have multiple Classes, and through that, they can have colliding Shortnames. Resources are also free to include Properties from other Classes, and their Shortnames, too, might collide.

For example:

["https://example.com/people/123", "https://example.com/name", "John"]
["https://example.com/people/123", "https://somepage.example.com/name", "John"]

Let's assume that https://somepage.example.com/name and https://example.com/name are Properties that have the Shortname: name.

What if a client tries something such as people123.name? To consistently return a single value, we need some type of precedence:

  1. The earlier Class mentioned in the class Property of the resource. Resources can have multiple classes, but they appear in an ordered ResourceArray. Classes, internally SHOULD have no key collisions in required and recommended properties, which means that they might have. If these exist internally, sort the properties from A-Z.
  1. When the Properties are not part of any of the mentioned Classes, use Alphabetical sorting of the Property URL.

When shortname collisions are possible, it's recommended to not use the shortname, but use the URL of the Property:

people123."https://example.com/name"

Atomic Data uses a lot of links. How do you deal with links that don't work?

  1. Use URIs schemes that use content dressing, such as IPFS URIs.

What's a URI, and what's a URL?

URI stands for Unique Resource Identifier

How does Atomic Schema relate to SHACL / SheX / OWL / RDFS?

These RDF ontologies are extremely powerful, well-documented and versatile.

Atomic Schema does not aim to be an formal ontological semantic framework - it is way too simple for that. It's just a simple modeling tool.

Atomic Commits

Disclaimer: Work in progress, prone to change.

Atomic Commits is a proposed standard for communicating state changes (events / transactions / patches / deltas / mutations) of Atomic Data. It is the part of Atomic Data that is concerned with writing, editing, removing and updating information.

Design goals

  • Event sourced: Store and standardize changes, as well as the current state. This enables versioning, history playback, undo, audit logs, and more.
  • Traceable origin: Every change should be traceable to an actor and a point in time.
  • Verifiable: Have cryptographic proof for every change. Know when, and what was changed by whom.
  • Identifiable: A single commit has an identifier - it is a resource.
  • Decentralized: Commits can be shared in P2P networks from device to device, whilst maintaining verifiability.
  • Extensible: The methods inside a commit are not fixed. Use-case specific methods can be added by anyone.
  • Streamable: The commits could be used in streaming context.
  • Familiar: Introduces as little new stuff as possible (no new formats or language to learn)
  • Pub/Sub: Subscribe to changes and get notified on changes.
  • ACID-compliant: An Atomic commit will only occur if it results in a valid state.
  • Atomic: All the Atomic Data design goals also apply here.

Motivation

Although it's a good idea to keep data at the source as much as possible, we'll often need to synchronize two systems. For example when data has to be queried or indexed differently than its source can support. Doing this synchronization can be very difficult, since most of our software is designed to only maintain and share the current state of a system.

I noticed this mainly when working on OpenBesluitvorming.nl - an open data project where we aimed to fetch and standardize meeting data (votes, meeting minutes, documents) from 150+ local governments in the Netherlands. We wrote software that fetched data from various systems (who all had different models, serialization formats and APIs), transformed this data to a single standard and share it through an API and a fulltext search endpoint. One of the hard parts was keeping our data in sync with the sources. How could we now if something was changed upstream? We queried all these systems every night for all meetings from the next and previous month, and made deep comparisons to our own data.

This approach has a couple of issues:

  • It costs a lot of resources, both for us and for the data suppliers.
  • It's not real-time - we can only run this once every 24 ours (because of how costly it is).
  • It's very prone to errors. We've had issues during all phases of Extraction, Transformation and Loading (ETL) processing.
  • It causes privacy issues. When some data at the source is removed (because it contained faulty or privacy sensitive data), how do we learn about that?

Persisting and sharing state changes could solve these issues. In order for this to work, we need to standardize this for all data suppliers. We need a specification that is easy to understand for most developers.

Keeping track of where data comes from is essential to knowing whether you can trust it - whether you consider it to be true. When you want to persist data, that quickly becomes bothersome. Atomic Data and Atomic Commits aim to make this easier by using cryptography for ensuring data comes from some particular source, and is therefore trustworthy.

FAQ

Is Atomic Commits a Conflict-free Replicated Data Type (CRDT)?

Since Atomic Data always has a clear owner, all changes are coming from a single source or truth. This prevents a lot of the issues that CRDT aims to solve, such as two people working on the same word at the same time in some text editor.

How does it compare to other delta formats?

See the compare section

Atomic Commits: Concepts

Commit

A Commit describes how a Resource must be updated. The required fields are:

  • subject - The thing being changed. A Resource Subject URL that the Commit is providing information about.
  • author - Who's making the change. The Atomic URL of the Author's profile - which in turn must contain a publicKey.
  • signature - Cryptographic proof of the change. A hash of the JSON-serialized Commit (without the signature field), signed by the actor's privateKey. This proves that the author is indeed the one who created this exact commit. The signature of the Commit is also used as the identifier of the commit.
  • createdAt - When the change was made. A UNIX timestamp number of when the commit was created.

The optional method fields describe how the data must be changed:

  • destroy - If true, the existing Resource will be removed.
  • remove - an array of Properties that need to be removed (including their values).
  • set - a Nested Resource which contains all the new or edited fields.

These commands are executed in the order above. This means that you can set destroy to true and include set, which empties the existing resource and sets new values.

Posting commits using HTTP

Since Commits contains cryptographic proof of authorship, they can be accepted at a public endpoint. There is no need for authentication.

A commit should be sent (using an HTTPS POST request) to a /commmit endpoint of an Atomic Server. The server then checks the signature and the author rights, and responds with a 2xx status code if it succeeded, or an 5xx error if something went wrong. The error will be a JSON object.

Serialization with JSON

Let's look at an example Commit:

{
  "subject": "http://examle.com/someResource",
  "createdAt": 1601239744,
  "author": "https://example.com/profile",
  "set": {
    "https://atomicdata.dev/properties/description": "my new resource description"
  },
  "remove": ["https://atomicdata.dev/properties/shortname"],
  "signature": "24c7d4b3c1b6b5f924243d67dbfc33fb680b5d3e2a77614cebe03c4a2840d29a"
}

This Commit can be sent to any Atomic Server. This server, in turn, should verify the signature and the author's rights before the server applies the Commit.

Calculating the signature

The signature is a base64 encoded Ed25519 signature of the deterministically serialized Commit. Calculating the signature is a delicate process that should be followed to the letter - even a single character in the wrong place will result in an incorrect signature, which makes the Commit invalid.

The first step is serializing the commit deterministically. This means that the process will always end in the exact same string.

  • Serialize the Commit as JSON.
  • Do not serialize the signature field.
  • Do not include empty objects or arrays.
  • If destroy is false, do not include it.
  • All keys are sorted alphabetically - both in the root object, as in any nested objects.
  • The JSON is minified: no newlines, no spaces.

Here's an example implementation of this process written in Rust.

This will result in a string. The next step is to sign this string using the Ed25519 private key. This signature is a byte array, which should be encoded in base64. Make sure that the Author's URL resolves to a Resource that contains the linked public key.

Congratulations, you've just created a valid Commit!

Author

An Author is a person, organization, computer or other type of agent that can create and sign Commits. The most important property of the Author is the

Atomic Commits compared to other (RDF) delta models

Let's compare the Atomic Commit approach with some existing protocols for communicating state changes / patches / mutations / deltas in linked data or JSON. First, I'll briefly discuss the existing examples (open a PR / issue if we're missing something!). After that, we'll discuss how Atomic Data differs from the existing ones.

RDF-Delta

https://afs.github.io/rdf-delta/

Describes changes (RDF Patches) in a specialized turtle-like serialization format.

TX .
PA "rdf" "http://www.w3.org/1999/02/22-rdf-syntax-ns#" .
PA "owl" "http://www.w3.org/2002/07/owl#" .
PA "rdfs" "http://www.w3.org/2000/01/rdf-schema#" .
A <http://example/SubClass> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Class> .
A <http://example/SubClass> <http://www.w3.org/2000/01/rdf-schema#subClassOf> <http://example/SUPER_CLASS> .
A <http://example/SubClass> <http://www.w3.org/2000/01/rdf-schema#label> "SubClass" .
TC .

Similar to Atomic Commits, these Delta's should have identifiers (URLs), which are denoted in a header.

Delta-LD

http://www.tara.tcd.ie/handle/2262/91407

PatchR

https://www.igi-global.com/article/patchr/135561

LD-Patch

https://www.w3.org/TR/ldpatch/

PATCH /timbl HTTP/1.1
Host: example.org
Content-Length: 478
Content-Type: text/ldpatch
If-Match: "abc123"

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix schema: <http://schema.org/> .
@prefix profile: <http://ogp.me/ns/profile#> .
@prefix ex: <http://example.org/vocab#> .

Delete { <#> profile:first_name "Tim" } .
Add {
  <#> profile:first_name "Timothy" ;
    profile:image <https://example.org/timbl.jpg> .
} .

Bind ?workLocation <#> / schema:workLocation .
Cut ?workLocation .

UpdateList <#> ex:preferredLanguages 1..2 ( "fr-CH" ) .

Bind ?event <#> / schema:performerIn [ / schema:url = <https://www.w3.org/2012/ldp/wiki/F2F5> ]  .
Add { ?event rdf:type schema:Event } .

Bind ?ted <http://conferences.ted.com/TED2009/> / ^schema:url ! .
Delete { ?ted schema:startDate "2009-02-04" } .
Add {
  ?ted schema:location [
    schema:name "Long Beach, California" ;
    schema:geo [
      schema:latitude "33.7817" ;
      schema:longitude "-118.2054"
    ]
  ]
} .

Linked-Delta

https://github.com/ontola/linked-delta

An N-Quads serialized delta format. Methods are URLs, which means they are extensible. Does not specify how to bundle lines. Used in production of a web app that we're working on. Designed with simplicity (no new serialization format, simple to parse) and performance in mind.

Initial state:

<http://example.org/resource> <http://example.org/predicate> "Old value 🙈" .

Linked-Delta:

<http://example.org/resource> <http://example.org/predicate> "New value 🐵" <http://purl.org/linked-delta/replace> .

New state:

<http://example.org/resource> <http://example.org/predicate> "New value 🐵" .

JSON-PATCH

http://jsonpatch.com/

A simple way to edit JSON objects:

The original document

{
  "baz": "qux",
  "foo": "bar"
}

The patch

[
  { "op": "replace", "path": "/baz", "value": "boo" },
  { "op": "add", "path": "/hello", "value": ["world"] },
  { "op": "remove", "path": "/foo" }
]

The result

{
  "baz": "boo",
  "hello": ["world"]
}

It uses the JSON-Pointer spec for denoting paths. It has quite a bunch of implementations, in various languages.

JSON-LD-PATCH

https://github.com/digibib/ls.ext/wiki/JSON-LD-PATCH

A JSON denoted patch notation for RDF. Seems similar to the RDF/JSON serialization format. Uses string literals as operators / methods. Conceptually perhaps most similar to linked-delta.

[
  {
    "op": "add",
    "s": "http://example.org/my/resource",
    "p": "http://example.org/ontology#title",
    "o": {
      "value": "New Title",
      "type": "http://www.w3.org/2001/XMLSchema#string"
    }
  }
]

SPARQL UPDATE

https://www.w3.org/TR/sparql11-update/

SPARQL queries that change data.

PREFIX dc: <http://purl.org/dc/elements/1.1/>
INSERT DATA
{
  <http://example/book1> dc:title "A new book" ;
                         dc:creator "A.N.Other" .
}

Allows for very powerful queries, combined with updates. E.g. rename all persons named Bill to William:

PREFIX foaf:  <http://xmlns.com/foaf/0.1/>

WITH <http://example/addresses>
DELETE { ?person foaf:givenName 'Bill' }
INSERT { ?person foaf:givenName 'William' }
WHERE
  { ?person foaf:givenName 'Bill'
  }

SPARQL Update is the most powerful of the formats, but also perhaps the most difficult to implement and understand.

Atomic Commits

Let's talk about the differences between the concepts above and Atomic Commits.

For starters, Atomic Commits can only work with a specific subset of RDF, namely Atomic Data. RDF allows for blank nodes, does not have subject-predicate uniqueness and offers named graphs - which all make it hard to unambiguously select a single value. Most of the alternative patch / delta models described above had to support these concepts. Atomic Data is more strict and constrained than RDF. It does not support named graphs and blank nodes. This enables a simpler approach to describing state changes, but it also means that Atomic Commits will not work with most existing RDF data.

Secondly, individual Atomic Commits are tightly coupled to specific Resources. A single Commit cannot change multiple resources - and most of the models discussed above to enable this. This is a big constraint, and it does not allow for things like compact migrations in a database. However, this resource-bound constraint opens up some interesting possibilities:

  • it becomes easier to combine it with authorization (i.e. check if the person has the correct rights to edit some resource): simply check if the Author has the rights to edit the Subject.
  • it makes it easier to find all Commits for a Resource, which is useful when constructing a history / audit log / previous version.

Thirdly, Atomic Commits don't introduce a new serialization format. It's just JSON. This means that it will feel familiar for most developers, and will be supported by many existing environments.

Finally, Atomic Commits use cryptography (hashing) to determine authenticity of commits. This concept is borrowed from git commits, which also uses signatures to prove authorship. As is the case with git, this also allows for verfiable P2P sharing of changes.

Interoperability: Relation to other technology

Atomic data is designed to be highly interoperable. It's also serializable to RDF, which includes Turtle, N-triples, RDF/XML and other serialization formats.

Data formats

  • JSON: Atomic Data is designed to be easily serializable to clean, idiomatic JSON. However, if you want to turn JSON into Atomic Data, you'll have to make sure that all keys in the JSON object are URLs that link to Atomic Properties, and the data itself also has to be available at its Subject URL.
  • RDF: Atomic Data is a strict subset of RDF, and can therefore be trivially serialized to all RDF formats (Turtle, N-triples, RDF/XML, JSON-LD, and others). The other way around is more difficult. Turning RDF into Atomic Data requires that all predicates are Atomic Properties, the values must match its properties datatype, the atoms must be available at the subject URL, and the subject-predicate combinations must be unique.

Protocols

  • IPFS: Content-based addressing to prevent 404s and centralization

How does Atomic Data relate to RDF?

RDF (the Resource Description Framework) is a W3C specification from 1999 that describes the original data model for linked data. It is the forerunner of Atomic Data, and is therefore highly similar in its model. Both heavily rely on using URLs, and both have a fundamentally simple and uniform model for data statements. Both view the web as a single, connected graph database. Because of that, Atomic Data is also highly compatible with RDF - all Atomic Data can be converted into valid RDF. Atomic Data can be thought of as a more constrained, type safe version of RDF. However, it does differ in some fundamental ways.

  • Atomic calls the three parts of a Triple subject, property and value, instead of subject, predicate, object.
  • Atomic does not support having multiple statements with the same <subject> <predicate>, every combination should be unique.
  • Atomic has no difference between literal, named node and blank node objects - these are all values, but with different datatypes.
  • Atomic uses paths instead of blank nodes
  • Atomic requires URL (not URI) values in its subjects and predicates (properties), which means that they should be resolvable.
  • Atomic only allows those who control a resource's subject URL endpoint to edit the data. This means that you can't add triples about something that you don't control.
  • Atomic has no separate datatype field, but it requires that Properties (the resources that are shown when you follow a predicate value) specify a datatype
  • Atomic has no separate language field, but it does support Translation Resources.
  • Atomic has a native Event (state changes) model (Atomic Mutations), which enables communication of state changes
  • Atomic has a native Schema model (Atomic Schema), which helps developers to know what data types they can expect (string, integer, link, array)

Why these changes?

I love RDF, and have been working with it for quite some time now. Using URIs (and more-so URLs, which are URIs that can be fetched) for everything is a great idea, since it helps with interoperability and enables truly decentralized knowledge graphs. However, some of the characteristics of RDF might have contributed to its relative lack of adoption.

It's too hard to select a specific value (object) in RDF

For example, let's say I want to render someone's birthday:

<example:joep> <schema:birthDate> "1991-01-20"^^xsd:date

Rendering this item might be as simple as fetching the subject URL, filtering by predicate URL, and parsing the object as a date.

However, this is also valid RDF:

<example:joep> <schema:birthDate> "1991-01-20"^^xsd:date <example:someNamedGraph>
<example:joep> <schema:birthDate> <example:birthDateObject> <example:someOtherNamedGraph>
<example:joep> <schema:birthDate> "20th of januari 1991"@en <example:someNamedGraph>
<example:joep> <schema:birthDate> "20 januari 1991"@nl <example:someNamedGraph>
<example:joep> <schema:birthDate> "2000-02-30"^^xsd:date <example:someNamedGraph>

Now things get more complicated if you just want to select the original birthdate value:

  1. Select the named graph. The triple containing that birthday may exist in some named graph different from the subject URL, which means that I first need to identify and fetch that graph.
  2. Select the subject.
  3. Select the predicate.
  4. Select the datatype. You probably need a specific datatype (in this case, a Date), so you need to filter the triples to match that specific datatype.
  5. Select the language. Same could be true for language, too, but that is not necessary in this birthdate example.
  6. Select the specific triple. Even after all our previous selectors, we still might have multiple values. How do I know which is the triple I'm supposed to use?

To be fair, with a lot of RDF data, only steps 2 and 3 are needed, since there are often no subject-predicate collisions. And if you control the data of the source, you can set any constraints that you like, inlcluding subject-predicate uniqueness. But if you're building a system that uses arbitrary RDF, that system also needs to deal with steps 1,4,5 and 6. That often means writing a lot of conditionals and other client-side logic to get the value that you need. It also means that serializing to a format like JSON becomes complicated - you can't just map predicates to keys - you might get collisions. Oh, and you can't use key-value stores for storing RDF, at least not in a trivial way. This complexity is the direct result of the lack of subject-predicate uniqueness.

As a developer who uses RDF data, I want to be able to do something like this:

// Fetches the resource
const joep = get("https://example.com/person/joep")

// Returns the value of the birthDate atom
console.log(joep.birthDate()) // => Date(1991-01-20)
// Fetches the employer relation at possibly some other domain, checks that resource for a property with the 'name' shortkey
console.log(joep.employer().name()) // => "Ontola.io"

Basically, I'd like to use all knowledge of the world as if it were a big JSON object. Being able to do that, requires some things that are present in JSON, and some things that are present in RDF

  • Traverse data on various domains (which is already possible with RDF)
  • Have unique subject-predicate combinations (which is default in JSON)
  • Map properties URLs to keys (which often requires local mapping with RDF, e.g. in JSON-LD)
  • Link properties to datatypes (which is possible with ontologies like SHACL / SHEX)

Less focus on semantics, more on usability

One of the core ideas of the semantic web, is that anyone should be able to say anything about anything, using semantic triples. This is one of the reasons why it can be so hard to select a specific value in RDF. When you want to make all graphs mergeable (which is a great idea), but also want to allow anyone to create any triples about any subject, you get subject-predicate non-uniqueness. For the Semantic Web, having semantic triples is great. For linked data, and connecting datasets, having atomic triples (with unique subject-predicate combinations) seems preferable. Atomic Data chooses a more constrained approach, which makes it easier to use the data, but at the cost of some expressiveness.

Changing the names

RDF's subject, predicate and object terminology can be confusing to newcomers, so Atomic Data uses subject, property, value. This more closely resembles common CS terminology. (discussion)

Subject + Predicate uniqueness

In RDF, it's very much possible for a graph to contain multiple statements that share both a subject and a predicate. One of the reasons this is possible, is because RDF graphs should always be mergeable. However, this introduces some extra complexity for data users. Whereas most languages and datatypes have key-value uniqueness that allow for unambiguous value selection, RDF clients have to deal with the possibility that multiple triples with the same subject-predicate combination might exist.

Atomic Data requires subject-property uniqueness, which means that this is no longer an issue for clients. However, in order to guarantee this, and still retain graph merge-ability we also need to limit who creates statements about a subject:

Limiting subject usage

RDF allows that anne.com creates and hosts statements about the subject john.com. In other words, domain A creates statements about domain B. It allows anyone to say anything about any subject, thus allowing for extending data that is not under your control.

For example, developers at both Ontola and Inrupt (two companies that work a lot with RDF) use this feature to extend the Schema.org ontology with translations. This means they can still use standards from Schema.org, and have their own translations of these concepts.

However, I think this is a flawed approach. In the example above, two companies are adding statements about a subject. In this case, both are adding translations. They're doing the same work twice. And as more and more people will use that same resource, they will be forced to add the same translations, again and again.

I think one of the core perks of linked data, is being able to make your information highly re-usable. When you've created statements about an external thing, these statements are hard to re-use.

This means that someone using RDF data about domain B cannot know that domain B is actually the source of the data. Knowing where data comes from is one of the great things about URIs, but RDF does not require that you can think of subjects as the source of data. Many subjects in RDF don't actually resolve to all the known triples of the statement. It would make the conceptual model way simpler if statements about a subject could only be made from the source of the domain owner of the subject. When triples are created about a resource in a place other than where the subject is hosted, these triples are hard to share.

The way RDF projects deal with this, is by using named graphs. As a consequence, all systems that use these triples should keep track of another field for every atom. To make things worse, it makes subject-predicate impossible to guarantee. That's a high price to pay.

I've asked two colleagues working on RDF about this constraint, and both were critical. The reason

No more literals / named nodes

In RDF, an object can either be a named node, blank node or literal. A literal has a value, a datatype and an optional language (if the literal is a string). Although RDF statements are often called triples, a single statement can consist of five fields: subject, predicate, object, language, datatype. Having five fields is way more than most information systems. Usually we have just key and value. This difference leads to compatibility issues when using RDF in applications. In practice, clients have to run a lot of checks before they can use the data - which makes RDF in most contexts harder to use than something such as JSON.

Atomic Data drops the named node / literal distinction. We just have values, and they are interpreted by looking at the datatype, which is defined in the property. When a value is a URL, we don't call it a named node, but we simply use a URL datatype.

Requiring URLs

RDF allows any type of URIs for subject and predicate value, which means they can be URLs, but don't have to be. This means they don't always resolve, or even function as locators. The links don't work, and that restricts how useful the links are. Atomic Data takes a different approach: these links MUST Resolve. Requiring Properties to resolve is part of what enables the type system of Atomic Schema - they provide the shortname and datatype.

Requiring URLs makes things easier for data users, at the cost of the data producer. With Atomic Data, the data producer MUST offer the triples at the URL of the subject. This is a challenge - especially with the current (lack of) tooling.

However - making sure that links actually work offer tremendous benefits for data consumers, and that advantage is often worth the extra trouble.

Replace blank nodes with paths

blank nodes are resources with identifiers that exist only locally. They make life easier for data producers, who can easily create nested resources without having to mint all the URLs. In most data models, blank nodes are the default. For example, we nest JSON object without thinking twice.

Unfortunately, blank nodes tend to make things harder for clients. These clients will now need to keep track of where these blank nodes came from, and they need to create internal identifiers that will not collide. Cache invalidation with blank nodes also becomes a challenge. To make this a bit easier, Atomic Data introduces a new way of dealing with names of things that you have not given a URL yet: Atomic Paths.

Since Atomic Data has subject-predicate uniqueness, we can use the path of triples as a unique identifier:

https://example.com/john https://schema.org/employer

So the way an Atomic Data store should store blank nodes, is simply as an atom with a Path as its URL. This prevents collisions and still makes it easy to point to a specific value.

Serialization formats are free to use nesting to denote paths - which means that it is not necessary to include these path strings explicitly in most serialization formats.

You can read more about Atomic Paths here.

Combining datatype and predicate

Having both a datatype and a predicate value can lead to confusing situations. For example, the schema:dateCreated Property requires an ISO DateTime string (according to the schema.org definition), but using a value true with an xsd:boolean datatype results in perfectly valid RDF. This means that client software using triples with a schema:dateCreated predicate cannot safely assume that its value will be a DateTime. So if the client wants to use schema:dateCreated values, the client must also specify which type of data it expects, check the datatype field of every Atom and provide logic for when these don't match. Also important combining datatype and predicate fits the model of most programmers and languages better - just look at how every single struct / model / class / shape is defined in programming languages: key: datatype. This is why Atomic Data requires that a predicate links to a Property which must have a Datatype.

Adding shortnames (slugs / keys) in Properties

Using full URI strings as keys (in RDF predicates) results in a relatively clunky Developer Experience. Consider the short strings that developers are used to in pretty much all languages and data formats (object.attribute). Adding a required / tightly integrated key mapping (from long URLs to short, simple strings) in Atomic Properties solves this issue, and provides developers a way to write code like this: someAtomicPerson.bestFriend.name => "Britta". Although the RDF ecosystem does have some solutions for this (@context objects in JSON-LD, @prefix mappings, the @ontologies library), these prefixes are not defined in Properties themselves and therefore are often defined locally or separate from the ontology, which means that developers have to manually map them most of the time. This is why Atomic Data introduces a shortname field in Properties, which forces modelers to choose a 'key' that can be used in ORM contexts.

Adding native arrays

RDF lacks a clear solution for dealing with ordered data, resulting in confusion when developers have to create lists of content. Adding an Array data type as a base data type helps solve this. (discussion)

Adding a native mutation standard

There is no integrated standard for communicating state changes. Although linked-delta and rdf-delta do exist, they aren't referred to by the RDF spec. I think developers need guidance when learning a new system such as RDF, and that's why Atomic Mutations is included in this book.

Adding a schema language

A schema language is necessary to constrain and validate instances of data. This is very useful when creating domain-specific standards, which can in turn be used to generate forms or language-specific types / interfaces. Shape validations are already possible in RDF using both SHACL and SHEX, and these are both very powerful and well designed.

However, with Atomic Data, I'm going for simplicity. This also means providing an all-inclusive documentation. I want people who read this book to have a decent grasp of creating, modeling, sharing, versioning and querying data. It should provide all information that most developers (new to linked data) will need to get started quickly. Simply linking to SHACL / SHEX documentation could be intimidating for new developers, who simply want to define a simple shape with a few keys and datatypes.

Also, SHACL requires named graphs (which are not specified in Atomic Data) and SHEX requires a new serialization format, which might limit adoption. Atomic Data has some unique constrains (such as subject-predicate uniqueness) which also might make things more complicated when using SHEX / SHACL.

However, it is not the intention of Atomic Data to create a modeling abstraction that is just as powerful as the ones mentioned above, so perhaps it is better to include a SHACL / SHEX tutorial and come up with a nice integration of both worlds.

A new name, with new docs

Besides the technical reasons described above, I think that there are social reasons to start with a new concept and give it a new name:

  • The RDF vocabulary is intimidating. When trying to understand RDF, you're likely to traverse many pages with new concepts: literal, named node, graph, predicate, named graph, blank node... The core specification provides a formal description of these concepts, but fails to do this in a way that results in quick understanding and workable intuitions. Even experienced RDF developers tend to be confused about the nuances of the core model.
  • There is a lack of learning resources that provide a clear, complete answer to the lifecycle of RDF data: modeling data, making data, hosting it, fetching it, updating it. Atomic Data aims to provide an opinionated answer to all of these steps. It feels more like a one-stop-shop for questions that developers are likely to encounter, whilst keeping the extendability.
  • All Core / Schema URLs should resolve to simple, clear explanations with both examples and machine readable definitions. Especially the Property and Class concepts.
  • The Semantic Web community has had a lot of academic attention from formal logic departments, resulting in a highly developed standard for knowledge modeling: the Web Ontology Language (OWL). While this is mostly great, its open-world philosophy and focus on reasoning abilities can confuse developers who are simply looking for a simple way to share models in RDF.

Convert RDF to Atomic Data

  • All the subject URLs MUST actually resolve, and return all triples about that subject. All blank nodes should be converted into URLs. Atomic Data tools might help to achieve this, for example by hosting the data.
  • All predicates SHOULD resolve to Atomic Properties, and these SHOULD have a datatype. You will probably need to change predicate URLs to Atomic Property URLs, or update the things that the predicate points to to include the required Atomic Property items (e.g. having a Datatype and a Shortname). This also means that the datatype in the original RDF statement can be dropped.
  • Literals with a language tag are converted to TranslationBox resources, which also means their identifiers must be created. Keep in mind that Atomic Data does not allow for blank nodes, so the TranslationBox identifiers must be URLs.

Step by step, it entails:

  1. Set up some server to make sure the URLs will resolve.
  2. Create (or find and refer to) Atomic Properties for all the predicates. Make sure they have a DataType and a Shortname.
  3. If you have triples about a subject that you don't control, change the URL to some that you can control, and refer to that external resource.

Atomic Data will need tooling to facilitate in this process. This tooling should help to create URLs, Properties, and host everything on an easy to use server.

Convert Atomic data to RDF

Since all Atomic Data is also valid RDF, it's trivial to convert / serialize Atoms to RDF. However, contrary to Atomic Data, RDF has optional Language and Datatype elements in every statement. It is good practice to use these RDF concepts when serializing Atomic Data into Turtle / RDF/XML, or other RDF serialization formats.

  • Convert Atoms with linked TranslationBox Resources to Literals with an xsd:string datatype and the corresponding language in the tag.
  • Dereference the Property and Datatype from Atomic Properties, and add the URLs in datatypes in RDF statements.

How does Atomic Data relate to JSON?

Because JSON is so popular, Atomic Data is designed to be easily serializable to JSON.

Atomic Data is a strict subset of RDF, and the most popular serialization of RDF for JSON data is JSON-LD. All JSON-LD is perfectly valid JSON, but with a couple of handy features.

From JSON to Atomic Data

Atomic Data requires a bit more information about pieces of data than JSON tends to contain. Let's take a look at a regular JSON example:

{
  "name": "John",
  "birthDate": "1991-01-20"
}

We need more information to convert this JSON into Atomic Data. The following things are missing:

  • What is the Subject URL of the resource being described?
  • What is the Predicate URL of the keys being used? (name and birthDate), and consequentially, how should the values be parsed? What are their DataTypes?

We can add this data by adding some @context:

{
  "@context": {
    "name": "https://example.com/properties/name",
    "birthDate": "https://example.com/properties/birthDate",
    "@id": "https://example.com/people/john"
  },
  "name": "John",
  "birthDate": "1991-01-20"
}

The JSON above is called JSON-LD. It is still perfectly valid JSON, but it contains more information, and in turn can be converted into RDF formats.

From Atomic Data to JSON-LD

Since Atomic Schema requires the presence of a key slug in Properties, converting Atomic Data to JSON results in dev-friendly objects with nice shorthands.

["https://example.com/john","https://example.com/properties/lastname","Houdini"]
["https://example.com/john","https://example.com/properties/bestFriend","https://example.com/sarah"]

Can be automatically converted to:

{
  "@context": {
    "name": "https://example.com/properties/lastname",
    "bestFriend": "https://example.com/properties/bestFriend",
  },
  "name": "John",
  "bestFriend": {
    "@id": "https://example.com/sarah"
  },
}

The @context object provides a mapping to the original URLs. The @id key shows that the value should be interpreted as a link (a URI).

JSON-LD Requirements

  • Make sure the URLs used in the @context resolve to Atomic Properties.
  • Convert JSON-LD arrays into ResourceArrays
  • Creating nested JSON objects is possible (by resolving the identifiers from @id relations), but it is up to the serializer to decide how deep this object nesting should happen.

Considerations

  • Whilst JSON-LD is great for traditional JSON usage (dot.syntax ORM style navigation of objects), it is not great for linked data usage.

Atomic Data and IPFS

What is IPFS

IPFS (the InterPlanetary File System) is a standard that enables decentralized file storage and retrieval using content-based identifiers. Instead of using an HTTP URL like http://example.com/helloworld, it uses the IPFS scheme, such as ipfs:QmX6j9DHcPhgBcBtZsuRkfmk2v7G5mzb11vU9ve9i8vDsL. IPFS identifies things based on their unique content hash (the long, seemingly random string) using a thing called a Merkle DAG (this great article explains it nicely). This is called a CID, or Content ID. This simple idea (plus some not so simple network protocols) allows for decentralized, temper-proof storage of data. This fixes some issues with HTTP that are related to its centralized philosophy: no more 404s!

Why is IPFS especially interesting for Atomic Data

Atomic Data is highly dependent on the availability of Resources, especially Properties and Datatypes. These resources are meant to be re-used a lot, and that would make everything expensive.

Considerations using IPFS URLs

They are static, their contents can never change. This is great for some types of data, but horrible for others. If you're describing a time-dependent thing (such as a person's job), If you're describing personal, private information, its also a bad idea to use IPFS, because it's designed to be permanent. Also, IPFS is not as fast as HTTP - at least for now.

Example of Atomic Data on IPFS

Here's an example, serialized to Atomic-NDJSON:

https://ipfs.io/ipfs/QmX6j9DHcPhgBcBtZsuRkfmk2v7G5mzb11vU9ve9i8vDsL

["https://atomicdata.dev/helloworld","https://atomicdata.dev/properties/description","Hello world!"]

Atomic Data and IPLD

IPLD (not IPFS) stands for InterPlanetary Linked Data, but is not related to RDF. The scope seems fundamentally different from RDF, too, but I have to read more about this. TODO!

Atomic Graph Validations

An Graph is a set of Atoms. Since Atomic Data is designed to facilitate decentralized data storage, Graphs will often lack information or contain invalid data. In this section, we define some of these concepts.

  • A Valid Graph contains no mismatches between Datatypes from Properties and their usage in Atoms
  • A Closed Graph contains no unfetched outgoing links
  • A Verified Graph contains only Atoms from verified Authors
  • A Schema Complete Graph contains all used linked Properties
  • A Frozen Graph contains content-addressing identifiers (e.g. IPFS), all the way down

These concepts are important when creating an implementation of a Store.

You can validate AD3 at atomicdata.dev/validate

Valid Graphs

We refer to a Graph as Valid, if the following constraints are met:

  • The Datatypes are correctly used. The Graph does not contain Atoms where the Datatype of the Value does not match the Datatype of the Property of the Atom.
  • The links work. All URLs used in the Graph (Subject, Property, Value) resolve correctly to the required Datatype.
  • The Class Restrictions are met. If a Class sets required properties, these must be present in Resources that are instances of that Class.

Making sure Graphs are Valid is of great importance to anyone creating, sharing or using Atomic Data. Services should specify whether they check the validity of graphs.

Closed Graphs

A Graph is Closed, when the Resources of all URLs are present in the Graph. In other words, if you were to fetch and download every single URL in a Graph, you would not have any more Atoms than before. There are no more unfetched outgoing links.

Closed Graphs are rarely required in Atomic Data; it's often perfectly fine to have outgoing links that do not have been fetched.

Verified Graphs

When you are given some Atomic Graph by someone, you initially don't know for sure whether the Atoms themselves are actually created by the one controlling the subject URL. Someone may have tempered with the data, or fabricated it.

The process of Verification can be done in two ways:

  1. Request the subjects, and check if the atoms match.
  2. Verify the signatures of the Resources or Mutations

When one of these steps is taken, we say that the Graph is Verified.

Schema Complete Graphs

When a Graph has a set of Atoms, it might not possess all the information that is required to determine the datatype of each Atom. When that is the case, we say the Graph is Schema Complete.

Having a Schema Complete Graph is essential for determining what the Datatype is of a Value. Most implementations of Atomic Data will need Schema Completeness to create fitting views, or apply functional business logic.

Imagine some application (perhaps an app running inside a web-browser) that has only the following data:

["https://example.com/john","https://example.com/birthDate","1991-01-20"]

Now, by looking at this single Atom, we might assume that the Value is an ISO date, but this type information is not known yet to the application. This type information should be specified in the example:birthDate Property. It is the responsibility of the application to make sure it possess the required Schema data.

We say a Graph is Schema Complete when it contains at least all the Property Classes that are used in the Property fields.

So let's add the missing Property: https://example.com/birthDate

["https://example.com/john","https://example.com/birthDate","1991-01-20"]
["https://example.com/birthDate","https://atomicdata.dev/datatypes/Datatype","https://atomicdata.dev/datatypes/dateTime"]

Now, since we've introduced yet another Property, we need to include that one as well:

["https://example.com/john","https://example.com/birthDate","1991-01-20"]
["https://example.com/birthDate","https://atomicdata.dev/datatypes/Datatype","https://atomicdata.dev/datatypes/dateTime"]
["https://atomicdata.dev/datatypes/Datatype","https://atomicdata.dev/datatypes/Datatype","https://atomicdata.dev/datatypes/atomicURI"]

Since all valid Atomic Data requires Property fields to resolve to Atomic Properties Classes, which are required to have an associated DataType... We can safely say that the last atom in the example above (the one describing https://atomicdata.dev/datatypes/Datatype) will have to be pre sent in all Schema Complete Atomic Graphs.

Frozen Graphs

A Frozen Graph consists only of resources with content-addressing identifiers as Subjects. A content-addressable URL (such as an IPFS URL) refers to specific immutable content, that is absolutely certain not to change over time. Due to its static nature, we call it Frozen. As long as a graph contains links to HTTP Resources, it is not Frozen, since responses from that HTTP address might change over time.

Freezing a Graph, therefore, entails converting all resources to IFPS (or another content-addressable schema) Resources, and using only IPFS URLs.

Freezing a Graph has performance benefits for clients, since clients can easily verify if they already have (part of) the Graph locally, simply by comparing the URLs or Resources. It also helps to make sure the content can be shared peer to peer

Note that Graphs with cyclical relations cannot be frozen, since every iteration that you'd try to freeze will change its references and therefore also its contents, and therefore also its content hash.

Tooling for Atomic Data

Because Atomic Data is very young, little tooling for Atomic Data exists. Great tooling is required to make this a success.

Existing tooling

atomic-cli

A tool for generating / querying Atomic Data from the command line.

# Add a mapping, and store the Atomic Class locally
atomic map person https://example.com/person
# Create a new instance with that Class
atomic new person
name (required): John McLovin
age: 31
Created at: ipfs:Qwhp2fh3o8hfo8w7fhwo77w38ohw3o78fhw3ho78w3o837ho8fwh8o7fh37ho
# link to an Atomic Server where you can upload your stuff
# If you don't, your data exists locally and gets published to IPFS
atomic setup
# install ontologies and add their shortnames to bookmarks
atomic install https://atomicdata.dev/ontologies/meetings
# when no URL is given, use the Ontola repo's ontologies
atomic install meetings

MIT licensed repo here.

atomic-lib (Rust)

Library that contains:

  • An in-memory store
  • Parsing (AD3) / Serialization (AD3, JSON, more to come)
  • Path traversal
  • Basic validation

MIT licensed repo here.

atomic-server

Server for hosting Atomic Data. Uses atomic-lib.

  • Responds to requests for created Atomic Resources, makes atomic data available at their URL.
  • Manages data on disk.
  • Useful query options (e.g. Triple Pattern Fragments)
  • Browser-friendly HTML presentation, JSON serialization, AD3 serialization.

MIT licensed repo here.

Some ideas for tooling

This document contains a set of ideas that would help achieve that success.

ATOML / VSCode Extension

Extending the TOML format to map it to Atomic Classes. This will make editing .TOML files awesome by providing on-screen validation, autocompletion and documentation for fields.

Atomizer (data importer and conversion kit)

  • Import data from some data source (CSV / SQL / JSON / RDF), fill in the gaps (mapping / IRI creation / datatypes) an create new Atoms
  • Perhaps a CLI, library, GUI or a combination of all of these

Atomic Preview

  • A simple (JS) widget that can be embedded anywhere, which converts an Atomic Graph into an HTML view.
  • Would be useful for documentation, and as a default view for Atomic Data.

Atomic-js (Javascript / Typescript)

A JS compatible library, accessible as an NPM package is the most popular and developer friendly way to start.

Here's some pseudocode that indicates how it might be used:

import {createStore} from '@atomicdata';

const config = {
  // A URL to a TPF compatible endpoint where the data can be fetched
  tpfEndpoint: "https://example.com/tpf",
  // A UTL to an Atomic Mutations endpoint where the client can subscribe to changes
  mutationsEndpoint: "https://example.com/mutations",
  // A UTL to an Atomic Suggestions endpoint where the client can send suggested state changes
  sugestionsEndpoint: "https://example.com/suggestions",
};

const store = createStore(config); // Initializes the store

// The `classInitializer` function takes an Atomic Class URI as its argument
// fetches the Class, its Properties and the DataTypes
// and returns a function that let's you create instances of that class
const personBuilder = await store.classInitializer("https://example.com/classes/Person");

// Create an instance of the Person Class
// An Atomic Suggestion is sent to the
const alice = await personBuilder({
  // The Subject field is optional, but recommended if you want to control its URL.
  // Otherwise, the Server will pick something
  subject: "https://example.com/alice",
  // The IDE is aware of the existing keys and their acceptable values,
  // because a conversion from Atomic Classes and Properties
  // to typescript interfaces can be made automatically
  firstName: "Alice",
  lastName: "Anderson",
  bestFriend: "https://example.com/Bob",
  birthDate: new Date("1991-01-20"),
  // Since the URL in the key below can be fetched, and has a Property + Datatype, the IDE + the compiler can determine that 'true' is an acceptable type.
  "https://example.com/someOtherProperty": true,
})

console.log(person.subject) //=> Should return a newly created identifier, https://example.com/alice

// Checks the store for the subject, and returns it.
// If the subject does not exists locally, it will fetch it first using the `tpfEndpoint`.
const alice = await store.get("https://example.com/alice")

// Because of the keys in Atomic Properties, we can use this dot syntax to traverse the graph and get a value
console.log(await alice.path("bestFriend.firstName")).value(); // => "Bob"
// What should happen here?
console.log(await alice.bestFriend); // => {...}

// It's also possible to convert a resource to a native JS object.
// By specifying the depth, nested resources will be fetched as well.
const aliceJS = await store.get("https://example.com/alice").toJS(depth: 2)

console.log(aliceJS.bestFriend) // => { name: Bob, birthdate: Date(1991-01-20)}

I think a Developer Experience similar to the one above is essential for getting people to create linked data. It should be incredibly easy, and this is what enables that. However, realizing a library + IDE support as shown above is hard at the least, perhaps even impossible. Theoretically, the information is accessible - but I'm not sure whether the IDE and the JS context (e.g. the Typescript compiler) can successfully see which shape is being returned by the classInitializer function.

Atomic Browser

A web-browser application that enables viewing, browsing, navigating Atomic Data.

Get involved

Atomic Data is an open specification, and that means that you're very welcome to share your thoughts and help make this standard as good as possible.

Things you can do:

Authors:

Special thanks to:

  • Thom van Kalkeren (who came up with many great ideas on how to work with RDF, such as HexTuples and linked-delta)
  • Tim Berners-Lee (for everything he did for linked data and the web)
  • Ruben Verborgh (for doing great work with RDF, such as the TPF spec)
  • All the other people who worked on the RDF specification
  • Pat McBennett (lots of valuable feedback on initial Atomic Data docs)