Efficient JSON-compatible binary format: "Smile"

"Smile" is codename for an efficient JSON-compatible binary data format, initially developed by Jackson JSON processor project team. Its logical data model is same as that of JSON, so it can be considered a "Binary JSON" format.

For design on "Project Smile", which implements support for this format, see Smile Format design goals page.

For usage with Jackson processor, see Jackson SMILE usage page.

This page covers current data format specification; which is planned to eventually be standardized through a format process (most likely as IETF RFC).

Document version

Version history

Update history

External Considerations

MIME Type

There is no formal or official MIME type registered for Smile content, but the current best practice (as of July 2011) is to use:

application/x-jackson-smile

since this is used by multiple existing projects.

High-level format

At high level, content encoded using this format consists of a simple sequence of sections, each of which consists of:

Header consists of:

And basically first 2 bytes form simple smiley and 3rd byte is a (Unix) linefeed: this to make command-line-tool based identification simple: choice of bytes is not significant beyond visual appearance. Fourth byte contains minimal versioning marker and additional configuration bits.

Low-level Format

Each section described above consist of set of tokens that forms properly nested JSON value. Tokens are used in two basic modes: value mode (in which tokens are "value tokens"), and property-name mode ("key tokens"). Property-name mode is used within JSON Object values to denote property names, and alternates between name / value tokens.

Token lengths vary from a single byte (most common) to 9 bytes. In each case, first byte determines type, and additional bytes are used if and as indicated by the type byte. Type byte value ranges overlap between value and key tokens; but not all type bytes are legal in both modes.

Use of certain byte values is limited:

Tokens: general

Some general notes on tokens:

Tokens: value mode

Value is the default mode for tokens for main-level ("root") output context and JSON Array context. It is also used between JSON Object property name tokens (see next section).

Conceptually tokens are divided in 8 classes, class defined by 3 MSB of the first byte:

These token class are are described below.

Token class: Short Shared Value String reference

Prefix: 0x00; covers byte values 0x01 - 0x1F (0x00 not used as value type token)

Token class: Simple literals, numbers

Prefix: 0x20; covers byte values 0x20 - 0x3F, although not all values are used

Rest of the possible values are reserved for future use and not used currently.

Token classes: Tiny ASCII, Small ASCII

Prefixes: 0x40 / 0x60; covers all byte values between 0x40 and 0x7F.

Token classes: Tiny Unicode, Small Unicode

Prefixes: 0x80 / 0xA0; covers all byte values between 0x80 and 0xBF; except that 0x80 is not encodable (since there is no 1 byte long multi-byte-character String)

Token class: Small integers

Prefix: 0xC0; covers byte values 0xC0 - 0xDF, all values used.

Token class: Misc; binary / text / structure markers

Prefix: 0xE0; covers byte values 0xE0 - 0xEF, 0xF8 - 0xFF: 0xF8 - 0xFF not used with this format version (reserved for future use)

Note, too, that value 0x36 could be viewed as "real" END_OBJECT; but is not included here since it is only encountered in "key mode" (where you either get a key name, or END_OBJECT marker)

This class is further divided in 8 sub-section, using value of bits #2, #3 and #4 (0x1C) as follows:

Tokens: key mode

Key mode tokens are only used within JSON Object values; if so, they alternate between value tokens (first a key token; followed by either single-value value token or multi-token JSON Object/Array value). A single token denotes end of JSON Object value; all the other tokens are used for expressing JSON Object property name.

Most tokens are single byte: exceptions are 2-byte "long shared String" token, and variable-length "long Unicode String" tokens.

Byte ranges are divides in 4 main sections (64 byte values each):

Resolved Shared String references

Shared Strings refer to already encoded/decoded key names or value strings. The method used for indicating which of "already seen" String values to use is designed to allow for:

Mechanism for resolving value string references differs from that used for key name references, so two are explained separately below.

Shared value Strings

Support for shared value Strings is optional, in that generator can choose to either check for shareable value Strings or omit the checks. Format header will indicate which option generator chose: if header is missing, default value of "false" (no checks done for shared value Strings; no back-references exist in encoded content) must be assumed.

One basic limitation is the encoded byte length of a String value that can be referenced is 64 bytes or less. Longer Strings can not be referenced. This is done as a performance optimization, as longer Strings are less likely to be shareable; and also because equality checks for longer Strings are most costly. As a result, parser only should keep references for eligible Strings during parsing.

Reference length allowed by format is 10 bits, which means that encoder can replace references to most recent 1024 potentially shareable (referenceable) value Strings.

For both encoding (writing) and decoding (parsing), same basic sliding-window algorithm is used: when a potentially eligible String value is to be written, generator can check whether it has already written such a String, and has retained reference. If so, reference value (between 0 and 1023) can be written instead of String value. If no such String has been written (as per generator's knowledge -- it is not required to even check this), value is to be written. If its encoded length indicates that it is indeeed shareable (which can not be known before writing, as check is based on byte length, not character length!), decoder is to add value into its shareable String buffer -- as long as buffer size does not exceed that of 1024 values. If it already has 1024 values, it MUST clear out buffer and start from first entry. This means that reference values are NOT relative back references, but rather offsets from beginning of reference buffer.

Similarly, parser has to keep track of decoded short (byte length <= 64 bytes) Strings seen so far, and have buffer of up to 1024 such values; clearing out buffer when it fills is done same way as during content generation. Any shared string value references are resolved against this buffer.

Note: when a shared String is written or parsed, no entry is added to the shared value buffer (since one must already be in it)

Shared key name Strings

Support for shared property names is optional, in that generator can choose to either check for shareable property names or omit the checks. Format header will indicate which option generator chose: if header is missing, default value of "trues" (checking done for shared property names is made, and encoded content MAY contain back-references to share names) must be assumed.

Shared key resolution is done same way as shared String value resolution, but buffers used are separate. Buffer sizes are same, 1024.


Future improvement ideas

NOTE: version 1.0 will NOT support any of features presented in this section; they are documented as ideas for future work.

In-frame compression?

Although there were initial plans to allow in-frame (in-content) compression for individual values, it was decided that support would not be added for initial version, mostly since it was felt that compression of the whole document typically yields better results. For some use cases this may not be true, however; especially when semi-random access is desired.

Since enough type bits were left reserved for binary and long-text types, support may be added for future versions.

Longer length-prefixed data?

Given that encoders may be able to determine byte-length for value strings longer than 64 bytes (current limit for "short" strings), it might make sense to add value types with 2-byte prefix (or maybe just 1-byte prefix and additional length information after first fixed 64 bytes, since that allows output at constant location. Performance measurements should be made to ensure that such an option would improve performance as that would be main expected benefit.

Pre-defined shared values (back-refs)

For messages with little redundancy, but small set of always used names (from schema), it would be possible to do something similar to what deflate/gzip allows: defining "pre-amble", to allow back-references to pre-defined set of names and text values. For example, it would be possible to specify 64 names and/or shared string values for both serializer and deserializer to allow back-references to this pre-defined set of names and/or string values. This would both improve performance and reduce size of content.

Filler value(s)

It might make sense to allocate a "no-op" value or values to allow for padding of messages. This would be useful for things like:

This would be a simple addition.

Chunked values

(note: inspiration for this came from CBOR format)

As an alternative for either requiring full content length (binary data), or end marker (long Strings, Objects, arrays), and to specifically allow better buffering during encoding, it might make sense to allow "chunked" variants wherein long content is encoded in chunks, size of which is individual indicated with length prefix, but whose total size need not be calculated. This would work well for including large data incrementally, and it could also allow for more efficient and flexible decoding.


Appendix: External definitions

ZigZag encoding for VInts

Smile uses ZigZag encoding (defined for protobuf format, see this example), which is a variant of generic VInts (Variable-length INTegers).

Encoding is done logically as a two-step process:

  1. Use ZigZag encoding to convert signed values to unsigned values: essentially this will "move the sign bit" as the LSB.

  2. Encode remaining bits of unsigned integral number, starting with the most significant bits: the last byte is indicated by setting the sign bit; all the other bytes have sign bit clear.
    • Last byte has only 6 data bits; second-highest bit MUST be clear (to ensure that value 0xFF is never used for encoding; values 0xC0 - 0xFF are not used for the last byte).
    • Other bytes have 7 data bits.


CategorySmile

SmileFormatSpec (last edited 2014-05-21 20:02:45 by TatuSaloranta)

Copyright ©2009 FasterXML, LLC