未加星标

Measuring Capacity Through Utilization

字体大小 | |
[数据库(综合) 所属分类 数据库(综合) | 发布者 店小二04 | 时间 2017 | 作者 红领巾 ] 0人收藏点击收藏

One of my favorite concepts when thinking about instrumenting a system to understand capacity is what I call “time utilization”. This is the question of, for a given thread in some system, what fraction of its time is spent doing what kind of work?

Let’s consider this notion via the concrete example of a job that consumes request logs off of a queue and updates hit counts in a database:

while True: req = queue.pop() route = compute_route(req) db.execute(''' UPDATE hitcount WHERE route = ? SET hits=hits + 1 ''', (route, )) req.finish()

How should we instrument this? How can we monitor how it is performing, understand why it’s getting slow if it ever does get too slow, and make an informed decision about how many such processes we need to run in order to comfortably our load?

We’re making several network calls (to our queue and to our database) and doing some local computation ( compute_route ). A good first instinct is to instrument the network calls, since those are likely the slow parts:

# ... with statsd.timer('hitcount.update'): db.execute(''' UPDATE hitcount WHERE route = ? SET hits=hits + 1 ''', (route, )) with statsd.timer('hitcount.finish): req.finish() # ...

With most statsd implementations, this will give you histogram statistics on the latencies of your database call; Perhaps it will report the median, p90 , p95 , and p99 .

We can definitely infer some knowledge about this process’s behavior from those statistics; If the p50 on hitcount.update is too high, likely our database is overloaded and we need to do something about that. But how should we interpret a slow p95 or p99? This is a background asynchronous job, so we don’t overly care about the occasional slow job, as long as it keeps up in general. And how do we tell if we’re running near capacity? If it’s making multiple calls, how do we aggregate the histograms into a useful summary?

We can get a much cleaner picture if we realize that don’t care about the absolute time spent in each component so much as we care about the fraction of all time spent in each phase of the component’s lifecycle. And we can instrument that by just counting the total time spent in each phase, and then configuring our frontend to normalize to a per-second count. That would look something like this (I’m assuming dogstatsd here, and a tag-based scheme):

@contextmanager def op(what): start = time.time() yield statsd.increment('hitcount.total_s', value=(time.time() - start), tags=["op:" + what]) while True: with op('receive'): req = queue.pop() with op('compute_route'): route = compute_route(req) with op('update): db.execute(''' UPDATE hitcount WHERE route = ? SET hits=hits + 1 ''', (route, )) with op('finish'): req.finish()

We can now tell our metrics system to construct a stacked graph of this metric, stacked by tags, and normalized to show count-per-second:


Measuring Capacity Through Utilization

The graph total shows you the number of workers, since each worker performs one second-per-second of work (above, I’m running with four workers, so we’re close, but we’re losing a bit of time somewhere).

We can also immediately see that we’re spending most of our time talking to the database 2.7s or (dividing by 4) about 68% of our time.

The above image shows a system at running at capacity. If we’re not, then the graph gives us an easy way to eyeball how much spare capacity we have:


Measuring Capacity Through Utilization

本文数据库(综合)相关术语:系统安全软件

分页:12
转载请注明
本文标题:Measuring Capacity Through Utilization
本站链接:http://www.codesec.net/view/522247.html
分享请点击:


1.凡CodeSecTeam转载的文章,均出自其它媒体或其他官网介绍,目的在于传递更多的信息,并不代表本站赞同其观点和其真实性负责;
2.转载的文章仅代表原创作者观点,与本站无关。其原创性以及文中陈述文字和内容未经本站证实,本站对该文以及其中全部或者部分内容、文字的真实性、完整性、及时性,不作出任何保证或承若;
3.如本站转载稿涉及版权等问题,请作者及时联系本站,我们会及时处理。
登录后可拥有收藏文章、关注作者等权限...
技术大类 技术大类 | 数据库(综合) | 评论(0) | 阅读(54)