未加星标

LMDB (and RocksDB) Benchmark on Intel NVMe Optane SSD

字体大小 | |
[数据库(综合) 所属分类 数据库(综合) | 发布者 店小二05 | 时间 2018 | 作者 红领巾 ] 0人收藏点击收藏

LMDB (and RocksDB) Benchmark on Intel NVMe Optane SSD
Intel Optane SSD Microbenchmark

Symas Corp. , August 2018

Following on our previous LMDB benchmarks , we recently had an opportunity to test on some Intel Optane hardware. In this set of tests we're using LMDB and RocksDB. (Other variants of LevelDB don't support ACID transactions, so they're not in the same class of DB engine anyway.) Also, instead of reusing the LevelDB benchmark code that we ported to other DB engines , we're now using our own functionally equivalent rewrite of the benchmarks in plain C

.

Since the point of these tests is to explore the performance of the Optane SSDs, the tests are configured much like the previous ondisk benchmark, using a database approximately 5x larger than RAM, to minimize the impact of caching in RAM and force the storage devices to be exercised. However, there are some twists to this as well: The Optane SSDs on NVMe can also be operated as if they were system RAM. The Optane technology still has higher latency than DRAM, but as we'll see, there's still a performance benefit to using this mode.

The hardware for these tests was graciously provided by our friends at Packet and system support was provided by Intel . The machine was based on an Intel S2600WFT motherboard with a pair of 16 core/32 thread Intel Xeon Gold 6142 processors and 192GB DDR4-2666 DRAM. Storage being tested included a 4 TB DC P4500 TLC NAND-Flash SSD and three 750GB DC P4800X Optane SSDs. The machine had Ubuntu 16.04 installed, with a 4.13.0-41-generic kernel. The software versions being used are LMDB 0.9.70 and RocksDB 5.7.3, both compiled from their respective git repos. (Note that LMDB 0.9.70 is the revision in the mdb.master branch, not an officially released version. The main difference is the addition of support for raw devices.)

Prior tests have already illustrated how performance varies with record sizes. In these tests we're strictly interested in the relative performance across the different storage types so we're only testing with a single record size. We're using the ext4 filesystem in these tests, configured once with journaling enabled and once with journaling disabled. Each test begins by loading the data onto a freshly formatted filesystem. We use a 750GB partition on the 4TB Flash SSD, to ensure that the filesystem metadata overhead is identical on the Flash and Optane filesystems. Additionally, we test LMDB on raw block devices, with no filesystem at all, to explore how much overhead the filesystems impose. RocksDB doesn't support running on raw block devices, so it is omitted from those tests.

The test is run using 80 million records with 16 byte keys and 4000 byte values, for a target DB size of around 300GB. The system is set so that only 64GB RAM is available during the test run. After the data is loaded a readwhilewriting test is run multiple times in succession. The number of reader threads is set to 1, 2, 4, 8, 16, 32, and 64 threads for each successive run. (There is always only a single writer.) All of the threads operate on randomly selected records in the database. The writer performs updates to existing records; no records are added or deleted so the DB size should not change much during the test. The results are detailed in the following sections.Here are the stats collected from initially loading the DB for the various storage configurations.

Storage Load Time CPU DB Size Context Switches FS Ops LMDB Journal Wall User Sys % KB Vol Invol In Out Write Amp Flash/Ext4 Y 11:50.91 01:15.70 09:40.36 92 322683976 5910595 1303 2640 840839736 10.5104967 Flash/Ext4 N 13:21.04 01:16.69 11:01.86 92 322683976 8086767 1241 3696 946659568 11.8332446 Flash N 17:25.23 03:29.26 04:11.36 44 80669411 1346 645369800 645487344 8.0685918 Optane/Ext4 Y 14:20.99 01:12.78 12:09.88 93 322683976 9991458 1170 552 928896808 11.6112101 Optane/Ext4 N 15:11.10 01:16.72 12:49.09 92 322683976 10487638 1377 1080 1029364408 12.8670551 Optane N 20:26.19 03:30.62 03:55.97 36 80670953 1305 645367344 645547472 8.0693434 RocksDB Journal Wall User Sys % KB Vol Invol In Out Flash/Ext4 Y 15:00.44 13:01.27 11:45.63 165 318790584 231768 3184 11400 1265319232 15.8164904 Flash/Ext4 N 14:30.45 12:53.43 10:46.62 163 318790584 215318 2786 11016 1265362424 15.8170303 Optane/Ext4 Y 02:13:40.00 13:51.74 11:14.07 18 318790328 339737 7549 11088 1265319000 15.8164875 Optane/Ext4 N 02:13:40.00 13:47.29 10:49.81 18 318790328 337922 7598 11256 1265364360 15.8170545

The "Wall" time is the total wall-clock time taken to run the loading process. Obviously shorter times are faster/better. The actual CPU time used is shown for both User mode and System mode. User mode represents time spent in actual application code; time spent in System mode shows operating system overhead where the OS must do something on behalf of the application, but not actual application work. In a pure RAM workload where no I/O occurs, ideally the computer should be spending 100% of its time in User mode, processing the actual work of the application. Since this workload is 5x larger than RAM, it's expected that a significant amount of time is spent in System mode performing actual I/O.

The "CPU" column is the ratio of adding the User and System time together, then dividing by the Wall time, expressed as a percentage. This shows how much work of the DB load occurred in background threads. Ideally this value should be 100, all foreground and no background work. If the value is greater than 100 then a significant portion of work was done in the background. If the value is less than 100 then a significant portion of time was spent waiting for I/O. When a DB engine relies heavily on background processing to achieve its throughput, it will bog down more noticeably when the system gets busy. I.e., if the system is already busy doing work on behalf of users, there will not be any idle system resources available for background processing.

The "Context Switches" columns show the number of Voluntary and Involuntary context switches that occurred during the load. Voluntary context switches are those which occur when a program calls a function that can block - system calls, mutexes and other synchronization primitives, etc. Involuntary context switches occur e.g. when a CPU must handle an interrupt, or when the running thread's time slice has been fully consumed. LMDB issues write() system calls whenever it commits a transaction, so there are a lot of voluntary context switches here. However, not every write() results in a context switch - this depends largely on the behavior of the OS filesystem cache. RocksDB is configured with a large cache (32GB, one half of available RAM) as well as a large write buffer (256MB) so it has far fewer voluntary context switches. But since this workload is dominated by I/O, the CPU overhead of LMDB"s context switches has little impact on the overall runtime.

The "FS Ops" columns show the number of actual I/O operations performed, which is usually different from the number of DB operations performed. Since the loading task is "write-only" we would expect few, if any, input operations. However, since the DB is much larger than RAM, it's normal for some amount of metadata to need to be re-read during the course of the run, as the written data pushes other information out of the filesystem cache. The number of outputs is more revealing, as it directly shows the degree of write amplification occurring. There are only 80 million DB writes being performed, but there are far more than 80 million actual writes occurring in each run. The results with the raw block device shows that the filesystem adds 25% more writes than the DB itself.


LMDB (and RocksDB) Benchmark on Intel NVMe Optane SSD

There are a few unexpected results here. The LMDB loads actually ran slower with the filesystem journal turned off. Also, the LMDB loads on the raw block device also ran slower than with a filesystem. The I/O statistics imply that the block device wasn't caching any of the device reads. RocksDB has a serious performance issue on the Optane filesystems, taking over 2 hours to load the data. There's no explanation for that yet.

Here's the load times plotted again, without the 2 hour outliers.


LMDB (and RocksDB) Benchmark on Intel NVMe Optane SSD

With LMDB on the raw block device, each write of a record results in an immediate write to the device, which always causes a context switch. So for 80 million records there are at least 80 million voluntary context switches. In general, even though this is a purely sequential workload, RocksDB performs more filesystem writes per database write than LMDB, and usually more filesystem reads. The latter is somewhat surprising because LSM-based designs are supposed to support "blind writes" - i.e., writing a new record shouldn't require reading any existing data - that's supposed to be one of the features that makes them "write-optimized." This LSM advantage is not in evidence here.

Overall, the specs for the Optane P4800X show 11x more random write IOPS and faster latency than the Flash P4500 SSD, but all of the load results here are slower for the P4800X than for the Flash SSD. Again, we have no explanation for why the results aren't more reflective of the drive specs. At a guess, it may be due to wear on the SSDs from previous users. It was hoped that doing a fresh mkfs before each run, which also explicitly performed a Discard Blocks step on the device, would avoid wear-related performance issues but that seems to have had no effect.

Throughput

The results for running the actual readwhilewriting test with varying numbers of readers are shown here.


LMDB (and RocksDB) Benchmark on Intel NVMe Optane SSD

Write throughput for RocksDB is uniformly slow, regardless of whether using the Flash or Optane SSD. In contrast, LMDB shows the performance difference that Optane offers, quite dramatically, with peak random write throughputs up to 3.5x faster on Optane than on Flash. Using the raw block device also yields a slightly higher write throughput than using the ext4 filesystem.


LMDB (and RocksDB) Benchmark on Intel NVMe Optane SSD

The difference in read throughput between Flash and Optane isn't so great at the peak workload of 64 reader threads, but there are more obvious differences at the greater numbers of threads. With LMDB on Flash, doubling the number of reader threads essentially doubles throughput, except at 64 readers where the increase is much smaller. The way the results bunch up at thread counts of 8 or more for LMDB on Optane imply that the I/O subsystem gets bottlenecked, and there's no headroom for further doubling. RocksDB's peak is still about the same (or slightly slower) on Optane as on Flash, and still slower than LMDB.

Conclusion

When using LMDB, the LMDB engine will never be the bottleneck in your workloads. When you move onto faster storage technologies, LMDB will let you utilize the full potential of that hardware. Inferior technologies like LSM designs won't.

PS: We mentioned using the Optane SSD as RAM in the introduction. Those test results will be shown in an upcoming post.

Files

The files used to perform these tests are all available for download.

90318154 Jul 27 01:28 data.tgz Command scripts, output, atop record

LibreOffice spreadsheet with tabulated resultshere. The source code for the benchmark drivers is all on GitHub . We invite you to run these tests yourself and report your results back to us.

本文数据库(综合)相关术语:系统安全软件

代码区博客精选文章
分页:12
转载请注明
本文标题:LMDB (and RocksDB) Benchmark on Intel NVMe Optane SSD
本站链接:https://www.codesec.net/view/611429.html


1.凡CodeSecTeam转载的文章,均出自其它媒体或其他官网介绍,目的在于传递更多的信息,并不代表本站赞同其观点和其真实性负责;
2.转载的文章仅代表原创作者观点,与本站无关。其原创性以及文中陈述文字和内容未经本站证实,本站对该文以及其中全部或者部分内容、文字的真实性、完整性、及时性,不作出任何保证或承若;
3.如本站转载稿涉及版权等问题,请作者及时联系本站,我们会及时处理。
登录后可拥有收藏文章、关注作者等权限...
技术大类 技术大类 | 数据库(综合) | 评论(0) | 阅读(233)