In this blog post we'll going to introduce a new data serialization format. This one focuses on enabling decoding objects as series of incremental object changes, which can be streamed over and emitted/applied partially over time. It could be used in scenarios such as database replication or even at application level, both to send query/response data or to stream incremental updates.
If you're interested further, a proof of concept (implemented in Rust) is available on Github. Now let's go over its motivation and features.
It's not only about speed & size
When discussing topic of serialization formats, usually most people start to obsess over their performance characteristics, such as payload size or serialization/deserialization speed. While non-negligible - in some web services serialization can make even 70% of the overall CPU time spend on processing the request - there are a lot more aspects to take into account.
One of them can be backward compatibility: ability for newer versions of app to understand the data produced by older ones. Usually it's easy to forget, since most of the use cases nowadays are based on some form of JSON flying between web server and the browser, which always keeps its clients up-to-date.
Situation gets little more complicated when we start talking about native apps, that can easily go out of date. Moreover, in such case forward compatibility (the ability for older versions of app to understand the data produced by newer ones) should be taken into account, at least to some extend. This is a scenario, where JSON - while still pretty ok - should be approached more cautiously. It's also a place where formats such as Google Protocol Buffers can operate with good results.
Another aspect is safety: can malicious user input cause problems, such as buffer overflows or out-of-memory issues? This usually comes from the fact, that serialized message usually describes a single entity (of variable size) that has to be parsed fully before continuing to the next step. In practice this is often handled by safety checks on web servers or proxies, but what if our application didn't configure one? Many mature serializers put additional limitations - i.e. max recursion depth - to quickly deny potentially malicious content. But what if we could deny that possibility at the binary format level?
Finally, there are a lot of less common features: ability to read individual fields without decoding entire message, to represent recursive links between entities, to represent abstract and concrete types, algebraic data types etc.
Prefixed Entry Object Notation
PEON is a proof of concept for a streaming object serialization format, aimed to provide a handy way to exchange tree-like, schema-less object data. In its essence, it's a combination of prefix-key encoding used in databases like RocksDB + flattened key-value representation of object tree structure.
First, what do we mean by "flattened representation"? Imagine following JSON structure:
{
"users": [
{ "name": "Alice", age: 23 },
{ "name": "Bob", age: 31 }
]
}
We could use JSON Path notation describing traversal from root → leaf values to represent this object as series of key-value pairs:
$.users[0].age = 23
$.users[0].name = "Alice"
$.users[1].age = 31
$.users[1].name = "Bob"
Here, we also sort these path in their lexical order, since it will become handy later.
Now you can think: "this is very inefficient". We ended up repeating a lot of data. This is where key-prefix encoding comes into play. What we do, is basically prefixing each entry with number describing length of that entry prefix bytes shared with the previous entry. With that we only need to write down the difference between them:
0|.users[0].age = 23
8|.name = Alice
6|[1].age = 31
8|.name = Bob
In practice, PEON format defines paths using binary sequence of tag-prefixed segments (either key or indexes).
Keys are UTF-8 byte strings. However we take advantage of one specific property: first 5 bits in ASCII table are reserved for non-printable special characters (like carriage returns), that are not valid identifiers in 99.99% of the object field names. We use these bits as delimeters, so that we know where one path segment starts and another one ends.
Indexes are variable length encoded integers. However unlike varint encoding known from protobuf, we use a length-prefix encoding, basically: first byte describes a following number of bytes used to store our variable length integer. While this is a little heavier than protobuf's varint for small values (and accidentally more lightweight beyond 32 bits), it maintains few important properties: we know how many index bytes to read ahead of time and our keys are still lexically sortable. Example:
Let's encode 100 and 10 000 using protobuf encoding. The result will be
0xc801
and0xa09c01
. If we store these byte strings alphabetically, 10 000 will appear before 100, which is bad for us. If we use prefix-length encoding, the result of serialization would be0x0164
and0x022710
, which will maintain lexical order enforced by the leading length byte.
Since key and index path segments are delimited by different tags, it's possible for an object to contain both of them, eg.:
$.div.class = "bg-gray-400 p-2 table"
$.div[0].div.class = "table-caption"
$.div[0].div[0] = "Description"
This may be useful for describing object trees, that not only conform to usual JSON equivalent format, but also others like XML data structures.
Finally, since we're using flattened representation, we can easily implement not only trees, but also entire graphs with recursive navigation properties: a key/path of one entry could be used as a value of another one, eg.:
$.users['Alice'].friends[0] = $.users['Bob']
Header format
With some basic ideas laid out, we can describe how does the format works. Each entry starts with 6 bytes of header, which consists of 3 fields:
u15
length of the keyu15
length of the common prefix between current of previous keyu16
length of the value
We reserve highest bits of key and prefix for special purpose in the extended format, since 32KiB for property path should be enough for any reasonable document (xD): if it's not then probably 64KiB won't be enough either.
As you may notice, length of a single value is limited to 64KiB. This is by design - we can guarantee that no payload will require more than ~96KiB buffer, preventing potentially malicious payload trying to overflow our deserializer. We'll cover how to deal with values greater than 64KiB later.
Future-proofing
As mentioned, we took one bit from key and prefix fields of the header to use them for special purposes.
Highest key bit is used to mark an extension entry. It's not used in current format definition, but one could imagine it for special purposes like i.e. snapshot checkpoints.
Highest common prefix bit is used to mark if entry is optional or mandatory. Optional entries are free to be skipped over by older versions of the serializers without errors. The purpose of this is too keep our format forward compatible. We don't know what the future holds, yet this way will allow us to bring changes and features while still allowing different versions of PEON format serializers to talk with each other.
Chunking
As mentioned above, we only allow individual entries to carry over values up to 64KiB in size. However PEON allows to spread bigger frames over multiple entries. Such entries are using keys finalised with continuation tags:
$.video.attributes.title = "Old Boy"
$.video.stream[0:] = <..64KiB frame..>
$.video.stream[65536:] = <..64KiB frame..>
$.video.stream[131072:] = <..64KiB frame..>
(You can also notice that this format can be used to encode both binary media streams and their metadata attributes.)
A continuation tag informs that the value field doesn't hold just single element, but rather a range of bytes, while the final index at the entry path informs about the the start position of that range.
This way we not only prevent potential overflow attacks, but can also implement:
- suspend/resume download semantics known from protocols like BitTorrent
- equivalents of HTTP Range requests
- ability to query or update individual part of paged files, such as the ones used by databases
- emit multiple file streams interchangeably
While the PEON format only limits maximal entry size, you can still chunk values into pieces so small, that individual entries could fit into limits of the transmission medium (i.e. AWS API Gateway uses 32KiB as a limit, while WebRTC cross-browser data channel messages can have limit even down to 16KiB).
Streamed serialization
You probably already noticed, that we don't treat the serialized objects as one big blob of data that must be read as a whole. Since we effectively operate on a stream of key-value entries, we can process it as such.
One of the advantages of such approach is that we can use PEON as an update format streamed continuously over i.e. Web Socket, WebRTC data channel, gRPC stream etc.
You don't need to limit yourself to only read/write full objects, you can also stream incremental changes. From PEON format point of view there's no difference between having a single element entry and updating that element with another entry (or clearing/deleting its value) by repeating path of modified field later on in the stream.
Filtering
As we operate on the stream of deltas, which we use to incrementally construct the state of our object, we can as well selectively filter which of them should be incorporated into final document.
Just as we've used JSON Path to present entry path for the document, we can use the same query language to evaluate and filter objects from our stream. Even better: since know the maximum path length (32KiB) and operate on flattened paths, implementing eval is easier, faster and less error prone than building a regular JavaScript JSON Path evaluator.
Stream filtering is pretty universal operation, which we can use for partial object construction. This means that we're not limited to the overall size of our object. It could potentially be bigger than our memory. It could be even bigger than our database: we could reconstruct it dynamically on demand, in a similar way to how GraphQL describes entire system domain starting from one object root. Nothing prevents you to represent entire application data as a single object.
Another use case is filtering for purpose of partial replication: as we can stream changes over the wire to other machines, we can decide which parts of our object should be kept public (replicated), and which one can be kept private.
Using PEON to represent schemas
For our proof of concept we picked more straightforward approach, where we don't define any kind of schema. However this format could technically serve as a binary interface for schema-oriented serializer. As we can define members as keys or indices, we could use the latter to represent tag identifiers that many serializers use, eg. Thrift-like schema:
struct Root {
1: list<User> users
}
struct User {
1: string name,
2: i32 age
}
could be represented as:
1.0.1 = "Alice" // [1:users].[0:index].[1:name]
1.0.2 = 23 // [1:users].[0:index].[2:age]
1.1.1 = "Bob" // [1:users].[1:index].[1:name]
1.1.2 = 31 // [1:users].[1:index].[2:age]
This could further reduce the size of our entries. Since PEON itself doesn't require indices to arrive in order - or to arrive at all (which useful for skipping or using default values) - it's really up to us how to interpret the index data.
Summary
This is just a proof of concept. PEON is not a typical serializer specification (ie. we didn't specify how to encode values), it can be understood as more of a binary interface that could be adapted into specific needs.
The demo present in the repository shows a schema-less variant (which basically flattens the JSON structure). The size of serialized example can be a little bit more or a little bit less than an equivalent JSON format - the deeper and more complex JSON document tree structure, the smaller PEON binary is compared to original JSON. But that's not the point.
Our goal was to make a format that can be operated on in a fast manner, without a need to fully deserialize into final object representation. Thanks to that we don't need to do any allocations that usually accompany creating object graphs or worry about object size. With that in mind I could deserialize 4.5MB of data split into 79200 entries in 855.917µs, using a basic proof of concept implementation.
While this was not yet implemented in the PoC, there's a potential to speed-up filtering of paths even more. Since in their raw form each path starts with number of bytes common with the previous path, we could use that to our advantage by caching how many bytes of path was matching the filter last time. Then, when moving to check the next path, we could continue from the last matching byte position instead of comparing full path against the filter query each time.
Comments