One of my most often repeated pieces of performance advice when using CouchDB or Cloudant is to avoid using include_docs=true in view queries. When you look at the work CouchDB needs to do, the reason for the recommendation becomes obvious.

During a normal view query, CouchDB must only read a single file on disk. It streams results directly from that file. I guess it's a b-tree style thing under the hood. Therefore, if you are reading the entire index or doing a range query with startkey and endkey , CouchDB can just find the appropriate starting point in the file on disk and then read until it reaches the end of the index or the endkey . Not much data needs to be held in memory as it can go straight onto the wire. As any data emitted by the map function is stored inline in the b-tree, it's very fast to stream this as part of the response.

When you use include_docs=true , CouchDB has a lot more work to do. In addition to streaming out the view row data as above, Couch has to read each and every document referenced by the view rows it returns. Briefly, this involves:

Loading up the database's primary data file. Using the document ID index contained in that file to find the offset within that file where the document data resides. Finally, reading the document data itself before returning the row to the client.

Given the view is ordered by some field in the document rather than by doc ID, this is essentially a lot of random document lookups within the main data file. That's a lot of extra tree traversals and data read from disk.

While in theory this is going to be much slower -- and many people I trust had told me this -- I'd not done a simple benchmark to get a feel for the difference myself. So I finally got around to doing a quick experiment to see what kind of affect this has. It was just on my MacBook Air (Mid-2012, 2GHz i7, 8GB RAM, SSD), using CouchDB 1.6.1 in a single node instance, so the specific values are fairly meaningless. The process:

I uploaded 100,000 identical tiny documents to the CouchDB instance. The tiny document hopefully minimises the actual document data read time vs. the lookups involved in reading data. I created two views, one which saved the document data into the index and one which emitted null instead. I pre-warmed the views by retrieving each to make sure that CouchDB had built them. I did a few timed runs of retrieving every row in each view in a single call. For the null view, I timed both include_docs=true and include_docs=false .

The view was simply:

{ "_id": "_design/test", "language": "javascript", "views": { "emitdoc": { "map": "function(doc) {\n emit(doc.id, doc);\n}" }, "emitnull": { "map": "function(doc) {\n emit(doc.id, null);\n}" } } }

And each document looked like:

{ "_id": "0d469cdd8a7c054bf5eed0c954000ba4", "value1": "sjntwpacttftohms" }

I then called each view and read through the whole thing, all 100,000 rows. I timed the calls using curl . It's not very statistically rigorous, but I don't think you need to be for this magnitude of difference. For kicks, I also eye-balled the CPU usage in top during each call and guessed an average.

Test Time, seconds Eye-balled CPU emitdoc 5.821 105% emitnull 4.502 99% emitnull?include_docs=true 48.492 140%

The headline result is that reading the document from the view index itself ( emitdoc ) was just over 8x faster than using include_docs . It's also significantly less computationally expensive. There's also a difference between reading emitnull and emitdoc , though far less pronounced.

This was done on CouchDB 1.6.1 on my laptop. So while it wasn't a Cloudant or CouchDB cluster, given clustered query processing and clustered read behaviour , I would say that the results there would be similar or worse.

While this is a read of 100,000 documents, which you might say is unusual, over the many calls an application will make for smaller numbers of documents this kind of difference will add up. In addition, it adds a lot of load to your CouchDB instance, and likely screws around with disk caches and the like.

So, broadly, it seems pretty sound advice to avoid include_docs=true in practice as well as in theory.

As a bonus, here's how to time things using curl .

本文数据库(综合)相关术语:系统安全软件

主题: CouchDBCPUMacBook2G
分页:12
转载请注明
本文标题:Why you should avoid using include_docs in Cloudant and CouchDB view queries
本站链接:http://www.codesec.net/view/530941.html
分享请点击:


1.凡CodeSecTeam转载的文章,均出自其它媒体或其他官网介绍,目的在于传递更多的信息,并不代表本站赞同其观点和其真实性负责;
2.转载的文章仅代表原创作者观点,与本站无关。其原创性以及文中陈述文字和内容未经本站证实,本站对该文以及其中全部或者部分内容、文字的真实性、完整性、及时性,不作出任何保证或承若;
3.如本站转载稿涉及版权等问题,请作者及时联系本站,我们会及时处理。
登录后可拥有收藏文章、关注作者等权限...
技术大类 技术大类 | 数据库(综合) | 评论(0) | 阅读(57)