未加星标

Osquery: Under the Hood

字体大小 | |
[数据库(综合) 所属分类 数据库(综合) | 发布者 店小二05 | 时间 2018 | 作者 红领巾 ] 0人收藏点击收藏

Four years, 243 contributors, and 4,573 commits (and counting!) have gone into the development of osquery . It is a complex project, with performance and reliability guarantees that have enabled its deployment on millions of hosts across a variety of top companies. Want to learn more about the architecture of the system?

This look under the hood is intended for users who want to step up their osquery game, developers interested in contributing to osquery, or anyone who would like to learn from the architecture of a successful open source project.

For those new to osquery, it may be useful to start with Monitoring macOS hosts with osquery , which provides an introduction to how the project is actually used.


Osquery: Under the Hood
Data flows withinosquery Query Engine

The promise of osquery is to serve up instrumentation data in a consistent fashion, enabling ordinary users to perform sophisticated analysis with a familiar SQL dialect. Osquery doesn’t just use SQLite syntax, the query engine is SQLite. Osquery gets all of the query parsing, optimization and execution functionality from SQLite, enabling the project to focus on finding the most relevant sources for instrumentation data.

Osquery doesn’t just use SQLite syntax, the query engine is SQLite.

It’s important to mention that, while osquery uses the SQLite query engine, it does not actually use SQLite for data storage. Most data is generated on-the-fly at query execution time through a concept we call “Virtual Tables”. Osquery does need to store some data on the host, and for this it uses an embedded RocksDB database (discussed later).


Osquery: Under the Hood
A complex osquery query: Find root processes with socket connections open to non-local hosts. Virtual Tables

Virtual Tables are the meat of osquery. They gather all the data that we serve up for analytics. Most virtual tables generate their data at query time ― by parsing a file or calling a system API.

Tables are defined via a DSL implemented in python. The osquery build system will read the table definition files, utilizing the directory hierarchy to determine which platforms they support, and then hook up all the plumbing for SQLite to dynamically retrieve data from the table.

At query time, the SQLite query engine will request the virtual table to generate data. The osquery code translates the SQLite table constraints in a fashion that the virtual table implementation can use to optimize (or entirely determine) which APIs/files it accesses to generate the data.

For example, take a simple virtual table like etc_hosts . This simply parses the /etc/hosts file and outputs each entry as a separate row. There’s little need for the virtual table implementation to receive the query parameters, as it will read the entire file in any case. After the virtual table generates the data, the SQLite engine performs any filtering provided in the WHERE clause.

A table like users can take advantage of the query context. The users table will check if uid or username are specified in the constraints, and use that to only load the metadata for the relevant users rather than doing a full enumeration of users. It would be fine for this table to ignore the constraints and simply allow SQLite to do the filtering, but we gain a slight performance advantage by only generating the requested data. In other instances, this performance difference could be much more extreme.

A final type of table must look at the query constraints to do any work at all. Take the hash table which calculates hashes of the referenced files. Without any constraint this table will not know what files to operate on (because it would be disastrous to try to hash every file on the system), and so will return no results.

The osquery developers have put a great deal of effort into making virtual table creation easy for community contributors. Create a simple spec file (using a custom DSL built in Python) and implement in C++ (or C/Objective-C as necessary). The build system will automatically hook things up so that the new table has full interoperability with all of the existing tables in the osquery ecosystem.

Schema file for etc_hosts table

Event System

Not all of the data exposed by osquery fits well into the model of generating on-the-fly when the table is queried. Take for example the common problem of file integrity monitoring (FIM). If we schedule a query to run every 5 minutes to capture the hash of important files on the system, we might miss an interval where an attacker changed that file and then reverted the change before our next scan. We need continuous visibility.

To solve problems like this, osquery has an event publisher/subscriber system that can generate, filter and store data to be exposed when the appropriate virtual table is queried. Event publishers run in their own thread and can use whatever APIs they need to create a stream of events to publish. For FIM on linux, the publisher generates events through inotify . It then publishes the events to one or more subscribers, which can filter and store the data (in RocksDB) as they see fit. Finally, when a user queries an event-based table, the relevant data is pulled from the store and run through the same SQLite filtering system as any other table results.

Scheduler

Some very careful design considerations went into the osquery scheduler. Consider deploying osquery on a massive scale, like the over 1 million production hosts in Facebook’s fleet. It could be a huge problem if each of these hosts ran the same query at the exact same time and caused a simultaneous spike in resource usage. So the scheduler provides a randomized “splay”, allowing queries to run on an approximate rather than exact interval. This simple design prevents resource spikes across the fleet.

It is also important to note that the scheduler doesn’t operate on clock time, but rather ticks from the running osquery process. On a server (that is never in sleep mode), this will effectively be clock time. On a laptop (often sleeping when the user closes the lid), osquery will only tick while the computer is active, and therefore scheduler time will not correspond well with clock time.

Diff Engine

In order to optimize for large scale and bubble up the most relevant data, osquery provides facilities for outputting differential query results. Each time a query runs, the results of that query are stored in the internal RocksDB store. When logs are output, the results of the current query are compared with the results of the existing query, and a log of the added/removed rows can be provided.

This is optional, and queries can be run in “snapshot” mode, in which the results are not stored and the entire set of query results are output on each scheduled run of the query.

RocksDB

Though much of the data that osquery presents is dynamically generated by the system state at query time, there are a myriad of contexts in which the agent stores data. For example, the events system needs a backing store to buffer events into between intervals of the queries running.

To achieve this, osquery utilizes another Facebook open

本文数据库(综合)相关术语:系统安全软件

tags: query,osquery,table,data,SQLite,time
分页:12
转载请注明
本文标题:Osquery: Under the Hood
本站链接:https://www.codesec.net/view/586451.html


1.凡CodeSecTeam转载的文章,均出自其它媒体或其他官网介绍,目的在于传递更多的信息,并不代表本站赞同其观点和其真实性负责;
2.转载的文章仅代表原创作者观点,与本站无关。其原创性以及文中陈述文字和内容未经本站证实,本站对该文以及其中全部或者部分内容、文字的真实性、完整性、及时性,不作出任何保证或承若;
3.如本站转载稿涉及版权等问题,请作者及时联系本站,我们会及时处理。
登录后可拥有收藏文章、关注作者等权限...
技术大类 技术大类 | 数据库(综合) | 评论(0) | 阅读(93)