未加星标

Variable size integer encoding

字体大小 | |
[数据库(综合) 所属分类 数据库(综合) | 发布者 店小二03 | 时间 2017 | 作者 红领巾 ] 0人收藏点击收藏

It is a known fact of nature that most integers are small. When transferring data over the network, or to/from disk, or pretty much anywhere the IO is slow, keeping data compact is a plus. Obviously, compression is an option, but it is fairly CPU heavy. Taking advantage of the fact that, in practice, most integer values are small is an option to effectively reduce the amount of data for a modest cost in term of CPU.

LEB128

LEB128 is a variable length encoding used in the file format DWARF, used for debug data and exception unwinding. Variation of LEB128 are used in many software, including git, protocol buffer and many others. The technique involve splitting the integer in groups of 7bits. Each bytes in the encoding contains 7 bits of the integer, plus one bit indicating if there is a next group of 7 bits coming.

Range Encoding 0 .. 2^7 1 0xxxxxxx 2^7 .. 2^14 1 1xxxxxxx 0xxxxxxx 2^14 .. 2^21 1 1xxxxxxx 1xxxxxxx 0xxxxxxx 2^21 .. 2^28 1 1xxxxxxx 1xxxxxxx 1xxxxxxx 0xxxxxxx 2^28 .. 2^35 1 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 0xxxxxxx UTF-8

UTF-8 was made with a few constraints in mind. First, it had to be compatible with ASCII, which means all value bellow 128 must be encoded as this. In addition, none of the remaining bytes in the multibyte encoding must be in the 0 127 range, so legacy software wouldn’t confuse them for ASCII characters. As text processing is often performance critical, it needed to be decodable quickly. Finally, decoding would happen forward, but also backward.

UTF-8 ‘s first byte is made of a variable numbers of 1, followed by a 0, and then actual bits from the data. The number of ones indicate the number of bytes this value uses. For one byte values, the number of ones would be 0, and, as a result, integer in the 0 127 range can be encoded as themselves (they all start with a 0 bit). This means that, contrary to the LEB style of encoding, the number of bytes is known up-front and this can be leveraged to increase ILP when forward decoding.

Following bytes are started by 10, and then 6 buts of payload. This is necessary so that none of theses bytes ends up in the 0 127 range, but also to help backward decoding. Because 1 bytes integers start with 0 and 2 bytes ones starts with 110, a bytes starting with 10 is a sure sign that it is within a multi-byte encoding.

Range Encoding 0 .. 2^7 1 0xxxxxxx 2^7 .. 2^11 1 110xxxxx 10xxxxxx 2^11 .. 2^16 1 1110xxxx 10xxxxxx 10xxxxxx 2^16 .. 2^21 1 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 2^21 .. 2^26 1 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

The UTF-8 encoding is not as efficient in term of space as the previously mentioned encoding, however, it is remarkable that it can achieve such a level of compatibility with raw ASCII, while remaining fairly compact and efficient.

Bitcoin’s var_int

The bitcoin protocol uses a form of variable size encoding for integer, mostly for expressing array like structures. The first byte of the integer is either 0xFD for 3 bytes encoding, 0xFE for 5 bytes encoding, 0xFF for 8 bytes encoding and is considered to be a 1 bytes integer for anything less than 0xFD .

this scheme is quite primitive but is very easy to understand. It encode integer on 1 bytes for values up to 252, but pessimize everything past that point because it require a 1 byte prefix.

Range Encoding 0 .. 252 (1 byte integer) 253 .. 2^16 1 0xFD (2 byte integer) 2^16 .. 2^32 1 0xFE (4 byte integer) 2^32 .. 2^64 1 0xFF (8 byte integer) Dlugosz’s Variable-Length Integer Encoding

This encoding is used in the ZIP format. It strive to encode small integers efficiently, but also try to have efficient encoding for larger yet common integer formats, such as 128 bits UUID, arbitrarily large integers and has extension points for future encoding. The format uses a header as does UTF-8 , but the header is somewhat more complex and subsequent bytes do not need to start with 10.

Follow that link to learn more about Dlugosz’ Variable-Length Integer Encoding .

Putting it all together

All these formats strike different trade offs. However, most do have some inherent inefficiencies. For instance, most of these formats offer redundant ways to encode small numbers as large number encoding can also be used to store small numbers. The range that large numbers can encode could by offsetting it by the largest value the previous range can encode. This comes at the cost of one addition per encoding/decoding, which is relatively cheap. While the range that one would win isn’t that great, it is sometime a problem to not have a canonical representation of numbers, and, in these case, a range check needs to be performed anyway, and that’s no less expensive than the addition, but do not extend the range.

Negative integers are represented as 2’s complement in computers. This effectively makes negative number looks like very large number and defeat most of these encoding schemes. In that case, when the number is negative, one will flip all the bits of the number, putting it back in the small number range. Then a bit is stolen somewhere to indicate if the bits of the final result needs to be flipped or not. If this bit is the chosen bit is the least significant one, encoding and decoding can be sone as such:

uint encode(int x) { auto base = x * 2; return (x >= 0) ? base : ~base; } int decode(uint x) { auto base = x / 2; return (x & 0x01) ? base : ~base; } Rolling our own variable length encoding format

In this exercise, we will create an encoding format for unsigned numbers. It matter to us that there is a canonical way to encode numbers, and we want the encoding and decoding process to be fast, so we’d like to know the number of bytes we are dealing with up-front. We don’t care much about backward decoding or accommodating other special number formats such as Dlugosz’s does.

We will chose to encode out first byte just as UTF-8 does, but remove the constraint that subsequent bytes needs to start with 10. This will allow to use an extra 2 bits per byte and will reduce the work needed to be done to clean these 2 bits when decoding. Because we want a canonical encoding, we will offset the ranges encoded.

Because we do not care about number greater than 2^64 in our use case, we can simply chose that if the first byte is full of ones, then the following 8 bytes are the integer we are encoding.

Range Encoding 0 .. 0x7F 0xxxxxxx 0x80 .. 0x407F 10xxxxxx xxxxxxxx 0x4080 .. 0x20407F 110xxxxx xxxxxxxx xxxxxxxx 0x204080 .. 0x1020407F 1110xxxx xxxxxxxx xxxxxxxx xxxxxxxx 0x10204080 .. 0x081020407F 11110xxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx Because we ar

本文数据库(综合)相关术语:系统安全软件

主题: CPUUT
分页:12
转载请注明
本文标题:Variable size integer encoding
本站链接:http://www.codesec.net/view/522245.html
分享请点击:


1.凡CodeSecTeam转载的文章,均出自其它媒体或其他官网介绍,目的在于传递更多的信息,并不代表本站赞同其观点和其真实性负责;
2.转载的文章仅代表原创作者观点,与本站无关。其原创性以及文中陈述文字和内容未经本站证实,本站对该文以及其中全部或者部分内容、文字的真实性、完整性、及时性,不作出任何保证或承若;
3.如本站转载稿涉及版权等问题,请作者及时联系本站,我们会及时处理。
登录后可拥有收藏文章、关注作者等权限...
技术大类 技术大类 | 数据库(综合) | 评论(0) | 阅读(63)