The Data Æther

George Svarovsky
codeburst
Published in
8 min readDec 14, 2020

--

Let’s talk about data.

Photo by Markus Spiske from Pexels

Data is intrinsic to software. It’s the inputs and the outputs. It flows in through interfaces, being checked and piped and deconstructed to fine granules, which find their way into and through the tiniest of subroutines. Some of it is discarded and deleted or garbage collected, but some will be transformed and routed and merged into new wholes and delivered.

This flowing of data is a pragmatic metaphor, and it pervades the way we write software. APIs accept requests and produce response bodies. Functions take parameters and emit return values or exceptions. Reactive streams materialize flows as first-class meta-programming citizens. Data is material and malleable.

I’m going to try and look at data a different way:

Data isn’t really like matter. It’s like space.

But what’s the problem with data flowing around like matter?

Worthless Wrangling

The trouble is that we think about data flows too much, and not enough. Too much, because we’re constantly having to wrangle data, without even changing its content, without adding any real value. Not enough, because we also seem to get it wrong, all the time.

Let me suggest three categories of worthless wrangling.

WW One: syntax

Transforming rows into DTOs into objects into JSON into models. The syntax shouldn’t be hard, but it’s worth saying that we still spend an awful lot of time and effort wrangling it.

WW Two: semantics

The data is statistics about hospitals, or an autonomous car’s world model, or a travel booking. But what I have in my code is a bunch of structs and lists and maps, cleverly-named, maybe, but differently arranged and named depending on which part of the application I happen to be in, and usually with different rules.

WW Three: truth

Number three is the most insidious of all. No matter how clever we get, physics always wins. We can’t deliver data faster than the speed of light. The data we have in our code is out of date. It might no longer be true. Have we controlled for that? Sometimes. We’ve added transactions or locking or other clever tricks. Other times, we only do it when we’re fixing the really evil bug that happened because we didn’t control for it well enough in the first place.

Put all these three categories together, and we all spend way too much time on this stuff while trying to mix in the valuable stuff; like making sure that the data is correct and available to the right people, and finally, eventually, like the interesting operations that make our application do something useful with it.

So how are we going to address the wrangling problem?

Imagine a world where data wrangling doesn’t have to happen.

Abstraction X

Let’s say that when we look at some data in code, with the intention of doing something useful with it — whether it’s in a browser, or a script, or a tiny subroutine — it always looks the same.

To start with, its syntax is always the same. In fact, we don’t even see the syntax. It’s just data, natively in my programming language. The semantics are always visible. It’s hospital data or a world model, cleverly named, with meaningful structure and rules. And, because of physics, we know what we’re in for. We know how close to the truth this data is. When I change it or make some more, I know whether my new data is just as true, or if I need to wait for some validation or consensus (so that consumers know what I based my decisions on).

Imagine if we were able to capture all of this in an abstraction, X. Abstractions are always the answer! When writing code, I use X, so I don’t have to do any syntax wrangling, semantics wrangling, or truth wrangling. That’s handled by someone else, who is very clever and makes X available to me.

Great! We’re done.

We’re Not Done

There are some niggles that make us wonder how X can exist in the real world. One is that “native” data structures aren’t consistent. I might love Haskell or C# or Python or Prolog (I might!). While I can probably express most of X in my language, my mileage is going to vary. A class in Java is not the same as a class in Javascript. (Can I add a property?) Even basic lists and maps and sets are not the same. (Equality? What about null and undefined and none and empty?)

Languages do not express data structures in the same or compatible ways.

When it comes to data semantics, at least most programming languages overwhelmingly have one thing in common: they suck at it. All the big, popular languages are imperative. That means they are great at saying ‘do this, do that’, but they have real trouble saying ‘this makes sense’. If I look at a class in Java, it’s really hard to see all the things that are always true, and the things that need to be true before some change is allowed, and the things that are true after those changes. If I’m lucky, someone has added some comments, or assertions, or written some test cases. (Or interface specifications, if I’m really lucky.)

But these things are in the code, and that means they’re re-coded everywhere the data appears. Or not, in which case the bug takes a little while to manifest when the data arrives at somewhere that notices (if it ever does). So, how do I know that a pre-condition was checked when all I get is the post state?

Custom semantics can be hard to express: invariants, allowed operations, inferences. And they must be re-expressed wherever the data appears, or risk later failures.

We want this stuff underneath abstraction X, so my code is guaranteed not to break the rules of the data (preferably at compile-time) but any time is better than never.

When it comes to truth, things are even more complicated. Languages come with baggage. Mutability. Threading. “Volatile” is a thing. This is not going well. Looks like we have a choice. On the one hand, we could reinvent programming; stop using our favorite languages, and move everyone to a language that has abstraction X built-in. But that’s a concept for another article, I think.

So let’s look at the other hand, where we back off from this idea of everything working completely natively. After all, everyone already has to deal with mapping one abstraction onto another, all the time. Let’s just assert what we want about X, and let clever programmers bind X to their language with libraries.

Requirements for Binding X to Code

Once X is bound to a language with a library, we’ll have a syntax for that language. I’d say it’s table stakes that anything that can be expressed natively is; but really that’s up to the binding library author (as they are the expert on that language).

There’s one quality that stands out, because it’s not well-supported by languages in general, and that’s having a universal address space. If we’re truly going to abstract away the vagaries of media and protocols, then we need a way to refer to some data not as a file on a filesystem, or a row in a database, or a memory location.

Beyond that, we need to be able to express and enforce the data structure and rules in X so that the code we write doesn’t break the data. Again, it would be nice if we could do this at compile-time, but it’s also important at runtime. Why? Because most applications have at least some variability in their data on a per-install, per-customer, or even per-session basis. As app developers, we often try to tuck this into carefully managed corners as ‘customization’ features. I have spent a lot of time in that corner.

Data variability at runtime should be a first-class concept.

Truth is the axis for which we don’t have prior art. I haven’t come across any language that addresses it consistently and head-on. We haven’t really got a canonical way to say this data is this far away, and this much out of date, and these will be the consequences of you editing it. There are some ideas — for example, you should check out the proposed Braid HTTP extension.

And naturally, for all of this, if things are going to change — and we know they will — we need a well-defined way to cope with that.

The Consequences of X

So let’s say we’ve done it, we have X and it’s available to our choice of programming language. What would we expect to be different?

The most important thing is that app developers will spend less time wrangling data through protocols and formats and layers. A programmer only needs to learn the correspondence between their platform and X.

But engineers who like doing those things can still get their kicks by developing more and better ways for X to get from A to B, knowing that their efforts will be recognized because the data itself advertises the result.

Physical protocols will only be relevant to those choosing or optimizing them.

And let’s notice one more thing. With X, data is not flowing anymore. I don’t have to think in terms of requesting and receiving and decoding and validating and locking and (briefly) adding value, and then encoding and responding and emitting and unlocking, and am-I-done-yet?

Instead, data is just there, bathing my code, and just a de-reference of an identifier away. Sometimes it’ll take a few milliseconds to arrive, but I already know how long and I can build those expectations into my software design. And tomorrow, when someone clever has noticed that I keep asking for such-and-such data, and makes it available to me before I ask, I can progressively enhance that design.

Final Thoughts

As the author of m-ld, I have my own ideas about how to achieve X, and the Data Æther. I’ll go into those next time. But what do you think? Can we get away from the endlessly expensive but worthless wrangling of syntax, semantics, and truth?

And what would we build if we could?

Update: Let me tell you about where this vision has led me, personally. I’m a practical type, and I’m only comfortable selling hopes and dreams if I can show that important parts of the dream work in a real life…

Wonderful photos by Robert Collins on Unsplash

--

--

Software engineer, architect, author, founder. Working on live information sharing at m-ld.org.