Encoding - blog - justinjc

I want to log the position of the pointer as it moves across the screen and persist this to storage in a space-efficient manner.

This can be represented by a series of datapoints, each consisting of a timestamp (milliseconds since the start of the recording) and a coordinate (the x and y screen position of the pointer). For simplicity, let’s consider each number an integer.

{
    timestamp: 188,
    coord: {
        x: 201,
        y: 543
    }
}

How much space does this take? In JavaScript, every number uses 8 bytes and since I have three numbers, every datapoint has 3 × 8 = 24 bytes.

Wait a minute, why is storing a number so large? Because each one is a double-precision (64-bit) floating-point number in IEEE 754 format. There are many resources explaining this standard, so I won’t go into the details here, but here is the number 188 in its full 64-bit glory:

0100000001100111100000000000000000000000000000000000000000000000 │└────┬────┘└───────────────────────┬──────────────────────────┘ │ exponent significand │ sign

JSON: Persisting#

Oh, what’s this handy JSON.stringify method? I can use this to serialize my array of datapoints and be done with it. Here are 10 datapoints:

[{"timestamp":320,"coord":{"x":55,"y":43}},{"timestamp":362,"coord":{"x":55,"y":44}},{"timestamp":476,"coord":{"x":55,"y":45}},{"timestamp":526,"coord":{"x":55,"y":44}},{"timestamp":576,"coord":{"x":56,"y":44}},{"timestamp":627,"coord":{"x":56,"y":45}},{"timestamp":768,"coord":{"x":56,"y":45}},{"timestamp":852,"coord":{"x":58,"y":45}},{"timestamp":986,"coord":{"x":59,"y":45}},{"timestamp":1254,"coord":{"x":62,"y":46}}]

Great! This encoding created a 422 length string, which is 422 bytes. How much space did I save compared to the 24 bytes per datapoint I started with?

compression_ratio = 1 − (compressed / uncompressed)
                  = 1 − (422 / (24 × 10))
                  = −0.758

Oh no, it’s negative. I used more space than the input. Let’s try to do better.

String: Who Needs Readability?#

My first serialization attempt is quite readable. I can easily tell where a datapoint starts and ends, and what each number corresponds to. But computers don’t need readability. I can pack my data into a blob, and as long as I can tell the computer how to unpack (deserialize) it back, I’m happy.

I’ll store the useful data in a string as a series of the three numbers in the format timestamp0 x0 y0 timestamp1 x1 y1 .... I still need the spaces in between the numbers so I know where numbers start and end.

To decode, I read three numbers at a time, delimited by spaces to reconstruct a datapoint.

Let’s introduce a modest 250 datapoints as a test dataset to keep track of my enhancements. Encoding this dataset using the JSON and String encodings, here’s what I have so far:

Encoding	Bytes	Compression Ratio
(input)	6000	0%
JSON	11213	-86.9%
String	3212	46.5%

From using a lot more space to saving 46.5%, that’s a fantastic improvement. Can I do better?

DeltaString: What’s the Difference?#

While timestamps and coordinates can easily go into the thousands, the difference between two consecutive datapoints seem to remain pretty small. This makes sense. The computer is trying to capture pointer movements very quickly, so there shouldn’t a big difference between timestamps. Also, the pointer shouldn’t have moved a great deal within such a short duration.

To take advantage of that, I can store the difference in values between consecutive datapoints. Instead of encoding 200 100 50 and then 216 101 55, I can store 200 100 50 (I need to know the starting point) and then 16 1 5. The second datapoint is now a string of length 6 instead of 10, already a saving of 4 bytes.

When I decode the blob, I take the first datapoint as is, and then construct the next datapoint by adding the next set of deltas. Easy.

Encoding	Bytes	Compression Ratio
(input)	6000	0%
JSON	11213	-86.9%
String	3212	46.5%
DeltaString	1755	70.8%

Vint: What’s in a Byte?#

Whenever I need to store a two-digit number, I need to use up two bytes. To understand that, let’s see what the string “16” looks like.

char	1	6
char code	49	54
binary	00110001	00110110

The character '1' corresponds to a character code of 49, which is then stored in a byte with the binary 00110001. Hold on a second, I’m storing 49, a two-digit number, in one byte! That’s good, but with one byte, I only have 8 bits to use, so I can only store up to the number 2⁸ - 1 = 255. I need a way to store bigger numbers if they arise.

Instead of using up all 8 bits in a byte for the number I want to store, I can special-case the first bit to indicate whether the next byte is also part of the current number. This way, I can represent numbers arbitrarily large. This is a variable-length integer encoding (vint), because instead of every number taking 8 bytes (as with double-precision numbers), each number takes a variable number of bytes, depending on how large they are.

How do I represent the number 1721, represented in binary by the 11 bits 11010111001?

10001101 00111001 │ └┬─┘ │└──┬──┘ │ first │ last │ 4 bits │ 7 bits │ │ signifies signifies more bytes last byte

In the first byte 10001101, I set the first bit to 1, meaning that the next byte is also part of this number. The next byte 00111001 starts with a 0, so this is the last byte of this number. Here, it is important to “right-align” the bits, otherwise I wouldn’t know which bit is the last one.

I will need to be able to store negative numbers since I’m storing coordinate deltas. One common way to achieve this would be to use two’s complement representation, meaning that the first bit of the number’s representation signifies its sign — 0 for positive, 1 for negative. This creates some edge conditions I need to be careful of. For example, if I want to encode 65 (1000001 in binary), the vint-encoded byte 01000001 would be incorrect.

01000001 ││ signifies│ last byte│ │ first bit is 1, therefore negative

Instead, this series of bits in vint-encoding actually represents the negative number -63. To properly encode 65, I need to make sure the first bit of data is 0. There is no space left in this byte, so I need to use another byte for that. The correct encoding for 65 is therefore:

10000000 01000001 ││ │ signifies│ signifies more bytes│ last byte │ first bit is 0, therefore positive

To reconstruct a number encoded in this way:

Note the first bit in the byte to know if I need to look at more bytes.
Append the next seven bits as data.
Repeat (1) and (2) until the last byte is processed.
Interpret the resultant data bits as two’s complement representation.

Going back to the assignment, I can encode datapoints with the same idea as DeltaString, except instead of storing the numbers in a string, I will vint-encode them and append the byte(s) to a blob.

Encoding	Bytes	Compression Ratio
(input)	6000	0%
JSON	11213	-86.9%
String	3212	46.5%
DeltaString	1755	70.8%
Vint	756	87.4%

Packed: Unused Padding#

I have an encoding that supports larger, outlying numbers, but these numbers are rare. Looking at the numbers, I still see that the majority of them can be represented using five or five bits. This means that for most numbers, there are two or three leading bits that I can try to pack data into instead of being padding.

What if I append each number’s binary directly next to each other? Let’s see what the bytes would look like if I encoded 13 (1101) and 10 (1010). Note the leading 0 to signify that they are positive numbers.

01101010 10000000 └─┬─┘└──┬─┘└─┬─┘ 13 10 0?

I see a few problems here:

How do I know that there isn’t a 0 represented in the second byte? Perhaps during serialization, I can vint-encode the number of numbers represented in the blob.
When I see these two bytes, how do I know where the beginning and end of each number is? Perhaps I can also vint-encode, in this case, that there are five bits to a number.
It’s convenient that the numbers 13 and 10 both require the same number of bits. Had I wanted to encode the numbers 13 and 1 instead, I can’t append 01 right after the binary of 13. If I encode that there are five bits to a number, I would have to left-pad, encoding 00001 instead.

Alright, that seems workable, but how many numbers should I encode inside one “packed” blob?

Too many numbers: the more numbers I put into a blob, the more likely I will encounter an outlier that requires many bits, causing the rest of the numbers in the blob to have more padding.
Too few numbers: if there are too few numbers, the metadata (how many bits to a number) I must store will be a big percentage of the blob, making compression size low.

I can choose to encode more or fewer numbers per blob based on heuristics, but then I would also need to encode how many numbers are in each blob, which could get expensive… I’ll just hard code the number of numbers per block.

Through testing, I found that encoding 20 numbers per block was a good middle ground (this may be different depending on different behavior patterns). Also through testing, timestamps tend to contain a higher percentage of large numbers compared to coordinate deltas. This makes sense, since any momentary pause in pointer movement can easily create differences in timestamps in the thousands (of milliseconds). I can take advantage of that by encoding the timestamps first, and then the x and y coordinates after. This will more likely create packed blocks using a smaller bit size.

Here’s what the beginning of a packed block will look like:

00000101 01101010 10000010 .. └──┬───┘ └─┬─┘└──┬─┘└─┬─┘└─.. │ 13 10 1 │ (0-padded) bit size (5, vint-encoded)

To decode:

Vint-decode one number: this is how many bits are used to represent each of the next 20 numbers (20 is what I decided earlier as a hardcoded number).
Decode the next 20 numbers, reading the same number of bits for each one, as specified by (1).
Repeat (2) and (3) until I’ve decoded the expected number of datapoints.

Encoding	Bytes	Compression Ratio
(input)	6000	0%
JSON	11213	-86.9%
String	3212	46.5%
DeltaString	1755	70.8%
Vint	756	87.4%
Packed	451	92.5%

Huffman: Many Duplicate Numbers#

Looking at the (delta) datapoints some more, there are some numbers that are repeated significantly more frequently than others. Yes, smaller numbers in general occur more than larger numbers, but timestamp deltas of 16 and 17 (milliseconds) are also extremely prevalent (probably due to the system trying to capture events 60 times a second).

It would be great if instead of encoding 16 or 17 in binary each time, I could assign a shorter code to them and encode that instead. By code, I mean to have a mapping where the bit 0 means the number 16, for example.

Generalizing this idea, it would be optimal if I counted the occurrences of each number, then assign the shortest codes to the most frequently occurring number. This is the idea behind Huffman coding.

To get the Huffman codes, I first need to construct a Huffman tree, a full binary tree where only the leaf nodes hold our data. There are a few ways to go about doing this. Here’s one way:

Get the frequencies of all numbers.
Put the numbers in a min heap according to their frequency.
Pop two nodes C1 and C2 from heap.
Create a new node P with its frequency being the sum of the frequencies of C1 and C2. Make C1 and C2 its children.
Put P back in the heap.
Repeat steps (3) to (5) until there is only one node in the heap. This is the root of the Huffman tree.

Next, I need to do a full traversal of the tree, remembering the path I take to get to each leaf. If I go left, I add 0 to the code, otherwise I add a 1 to the code.

R 1: 0 ╭─┴─╮ 16: 10 1 ε 2: 110 ╭─┴─╮ 285: 111 16 ε ╭─┴─╮ 2 285

The diagram above shows an example Huffman tree with the numbers (1, 2, 16, 285) and their associated Huffman codes. R is the root node, and all other non-leaf nodes are labeled ε. Going through two examples: 1 has the code 0 because starting from R, I go left, and reach the 1 node; 2 has the code 110 because starting from R, I go right, right, left to reach 2.

Not only are these codes short, they also don’t require any delimiter. Reading bits from left to right, I can follow the tree starting from the root and go left or right until I get to a leaf, yielding a number. Then I know that the bit immediately following that is the start of the next left/right instructions (starting back from the root) to get me to my next number.

So am I done? After I have the Huffman codes for every number I’ve seen, I simply append the code for each number I want to encode one after the other and I’m finished! Hold on, when I want to decode a blob full of these codes, how will I know what code corresponds to what number? I need to find a way to encode that information as well. I can either encode the Huffman tree somehow and reconstruct the code, or encode the number-to-code mapping directly.

Let’s try encoding the tree. One way to serialize a tree is do a depth-first search, preorder traversal, encoding each node when a “visit” is made. Typically, a special identifier is used to indicate when a child is null, meaning that every leaf would need to store two additional nulls. However, in this case, I can take advantage of the fact that Huffman trees are always full binary trees. This means that I only have two node types:

A leaf node that has a number: this means I can just vint-encode the number, and the presence of a valid number is in itself an indication that it is a leaf node (and therefore no children).
An internal node with no number, having exactly two non-null children: I will encode the byte 11000000 to signify this node. Why 11000000? If this were the start of a vint-encoded number, the largest number this can represent is the vint-encoded bytes 11000000 01111111, which is -8065 (10000001111111 in two’s complement). A number this low or lower will never appear in a datapoint — timestamp deltas are only positive, and the highest resolution screens are around 8K, with a width of less than 8000 pixels across.

Using this idea, how would I encode the example Huffman tree from earlier?

R ╭─┴─╮ 1 A ╭─┴─╮ 16 B ╭─┴─╮ 2 285

Pre-order traversal: R, 1, A, 16, B, 2, 285
Use 11000000 for non-numbers; vint-encode everything else
- 11000000: R
- 00000001: 1
- 11000000: A
- 00010000: 16
- 11000000: B
- 00000010: 2
- 10000010 00011101: 285

Serializing the Huffman coding using eight bytes seems pretty good, potentially better than serializing the mapping itself.

With that, I have the complete encoding. Serialize the Huffman tree, then serialize all numbers using the Huffman codes that it generates, and I am done.

Encoding	Bytes	Compression Ratio
(input)	6000	0%
JSON	11213	-86.9%
String	3212	46.5%
DeltaString	1755	70.8%
Vint	756	87.4%
Packed	451	92.5%
Huffman	277	95.4%

Coda#

And so concludes my journey into compressing pointer positions over time. Starting with 6000 bytes of data, I got it down to 277 bytes, a compression ratio of 95.4%. Not bad.

It’s important to note that Huffman, for example, is not always better than vint or packed encoding. Clicking and dragging around in this playground, depending on behavior, vint, packed, or Huffman encodings could wind up with the best compression.

There are lots of potential extensions to this project, including how to handle floating point numbers, streaming compression, more encoding schemes, and extracting ideas from research in time series databases. But maybe later.