未加星标

Why top and free inside containers don't show the correct container memory

字体大小 | |
[系统(linux) 所属分类 系统(linux) | 发布者 店小二04 | 时间 2018 | 作者 红领巾 ] 0人收藏点击收藏

Hey,

Something that is very common to get wrong when starting with linux containers is to think that free and other tools like top should report the memory limits.


Why top and free inside containers don't show the correct container memory

Here you’ll not only go through why that happens and how to get it right, but also take a look at where is the Kernel looking for information when you ask it for memory statistics.

Also, if you’re curious about how the code for keeping track of per-cgroup page counter looks, stick to the end!

This is the third article in a series of 30 articles around procfs : A Month of /proc .

If you’d like to keep up to date with it, make sure you join the mailing list !

Running top within a container How the top and free tools gather memory statistics Setting container limits Memory limits set by cgroups are not namespaced Who’s controlling the allocation of memory? Tracing a cgroup running out of memory Running top within a container

To get a testbed for the rest of the article, consider the case of running a single container with a memory limit of 10MB in a system that has 2GB of RAM available:

# Check the amount of memory available # outside the container (i.e., in the host) free -h total used free available Mem: 1.9G 312M 385M 1.5G # Define the total number of bytes that # will dictate the memory limit of the # container. MEM_MAX="$((1024 * 1024 * 10))" # Run a container using the ubuntu image # as its base image, with the memory limit # set to 10MB, and a tty as well as interactive # support. docker run \ --interactive \ --tty \ --memory $MEM_MAX \ ubuntu

With the container running, we can now check what are the results from executing top over there:

top -bn1 Tasks: 2 total, 1 running, 1 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.2 us, 0.1 sy, 0.0 ni, 99.7 id, 0.1 wa, 0.0 hi, 0.0 si, 0.0 st .----------------. | | KiB Mem :| 2040940 total, | 117612 free, 651204 used, 1272124 buff/cache KiB Swap:| 0 total, | 0 free, 0 used. 1196972 avail Mem *--+-------------* PID USER | PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1 root | 20 0 18508 3432 3016 S 0.0 0.2 0:00.02 bash 12 root | 20 0 36484 3104 2748 R 0.0 0.2 0:00.00 top | *---> Not really what we expect, that is 2GB!!

As we outlined before, not what one would typically expect (it reports the total available memory as seen in the host - not showing the 10MB limit at all).

What about free ? Same thing:

free -h total used free available Mem: 1.9G 612M 131M 1.2G Swap: 0B 0B 0B How the top and free tools gather memory statistics

If we go inspect what are the syscalls being used by both top and free , we can see that they’re making use of plain open(2) and read(2) calls:

# Check what are the syscalls being # used by `free` strace -f free ... openat(AT_FDCWD, "/proc/meminfo", O_RDONLY) = 3 .-------. | v read(3, "MemTotal: | 2040940 kB\nMemF"..., 8191) = 1307 ... | That is 2GB! # Check what are the syscalls being used # by `top` strace -f top -p 19282 -bn1 ... openat(AT_FDCWD, "/proc/meminfo", O_RDONLY) = 5 lseek(5, 0, SEEK_SET) = 0 read(5, "MemTotal: 2040940 kB\nMemF"..., 8191) = 1307 ... ^ | 2GB again --------*

Looking at those return values (what it’s read), we can spot that the “problem” is coming from /proc/meminfo , which free and top are just blindly trusting.

Before we go check what the Kernel is doing when reporting those values, let’s quickly remember how a container gets memory limits set.

Setting container limits

The way that Docker (ok, runc ) ends up setting the container limits is via the use of cgroups .

As very well documented in the man page (see man 7 cgroups :

Control cgroups, usually referred to as cgroups, are a Linux kernel feature which allows processes to be organized into hierarchical groups whose usage of various types of resources can then be limited and monitored.

To see that in action, consider the following program that allocates memory in chunks of 1MB:

#include <stdio.h> #include <stdlib.h> #include <string.h> #define MEGABYTE (1 << 20) #define ALLOCATIONS 20 /** * alloc - a "leaky" program that just allocated * a predefined amount of memory and then * exits. */ int main(int argc, char** argv) { printf("allocating: %dMB\n", ALLOCATIONS); void* p; int i = ALLOCATIONS; while (i-- > 0) { // Allocate 1MB (not initializing it // though). p = malloc(MEGABYTE); if (p == NULL) { perror("malloc"); return 1; } // Explicitly initialize the area that // has been allocated. memset(p, 65, MEGABYTE); printf("remaining\t%d\n", i); } }

We can see that without any limits, we can keep allocating past 20MB without problems.

# Keep allocating memory until the 20MB # mark gets reached. ./alloc.out allocating: 20MB remaining 19 remaining 18 ... remaining 1 remaining 0

That changes after we put our process under a cgroup with memory limits set:

# Create our custom cgroup mkdir /sys/fs/cgroup/memory/custom-group # Configure the maximum amount of memory # that all of the processes in such cgroup # will be able to allocate echo "$((1024 * 1024 * 10))" > \ /sys/fs/cgroup/memory/custom-group/memory.limit_in_bytes # Put the current process tree under such # cgroup echo $$ > \ /sys/fs/cgroup/memory/custom-group/tasks # Try to allocate the 20MB ./alloc.out allocating: 20MB remaining 19 remaining 18 remaining 17 remaining 16 remaining 15 remaining 14 remaining 13 remaining 12 Killed

Looking at the results from dmesg , we can see what happened:

our thing getting killed! .------------. [181346.109904] alloc.out invoked | oom-killer:| *------------* [181346.109906] alloc.out cpuset=/ mems_allowed=0 [181346.109911] CPU: 0 PID: 22074 Comm: alloc.out Not tainted 4.15.0-36-generic #39-Ubuntu [181346.109911] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [181346.109912] Call Trace: [181346.109918] dump_stack+0x63/0x8b [181346.109920] dump_header+0x71/0x285 [181346.109923] oom_kill_process+0x220/0x440 [181346.109924] out_of_memory+0x2d1/0x4f0 [181346.109926] mem_cgroup_out_of_memory+0x4b/0x80 [181346.109928] mem_cgroup_oom_synchronize+0x2e8/0x320 [181346.109930] ? mem_cgroup_css_online+0x40/0x40 [181346.109931] pagefault_out_of_memory+0x36/0x7b [181346.109934] mm_fault_error+0x90/0x180 [181346.109935] __do_page_fault+0x4a5/0x4d0 [181346.109937] do_page_fault+0x2e/0xe0 [181346.109940] ? page_fault+0x2f/0x50 [181346.109941] page_fault+0x45/0x50 Killed! ... ____________________________ / \ [181346.109950] Task in /custom-group killed as a result of limit of /custom-group [181346.109954] memory: usage 10240kB, limit 10240kB, failcnt 56 [181346.109954] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0 [181346.109955] kmem: usage 940kB, limit 9007199254740988kB, failcnt 0 [181346.109955] Memory cgroup stats for /custom-group: cache:0KB rss:9300KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:9248KB inactive_file:0KB active_file:0KB unevictable:0KB [181346.109965] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [181346.110005] [21530] 0 21530 5837 1381 90112 0 0 bash [181346.110011] [22074] 0 22074 3440 2594 69632 0 0 alloc.out [181346.110012] Memory cgroup out of memory: Kill process 22074 (alloc.out) score 989 or sacrifice child [181346.318942] Killed process 22074 (alloc.out) total-vm:13760kB, anon-rss:8988kB, file-rss:1388kB, shmem-rss:0kB [181346.322003] oom_reaper: reaped process 22074 (alloc.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

So we can see pretty well that limits are being enforced.

Again, why is /proc telling us that we have 2GB of memory?

Memory limits set by cgroups are not namespaced

The reason why is that the memory retrieved by /proc/meminfo is not namespaced.

Differently from other things like listing pids from /proc , when the file_operations that procfs implements reach the point of gathering memory information, it doesn’t acquire a namespaced view of it.

For instance, let’s compare the way that listing the differences in showing contents under /proc/ (listing the directory entries) and /proc/meminfo .

In the case of listing /proc (see How is /proc able to list process IDs? ), we can see procfs taking the namespace reference and using it:

int proc_pid_readdir(struct file *file, struct dir_context *ctx) { // Takes the namespace as seen by the file // provided. struct pid_namespace *ns = file_inode(file)->i_sb->s_fs_info; // ... // Iterates through the next available tasks // (processes) as seen by the namespace that // we are within. for (iter = next_tgid(ns, iter); iter.task; iter.tgid += 1, iter = next_tgid(ns, iter)) { // ... } // ... }

Meanwhile, in the case of reading /proc/meminfo , that doesn’t happen at all (well, as expected, it’s not about namespaces! It’s about cgroups):

static int meminfo_proc_show(struct seq_file *m, void *v) { struct sysinfo i; // ... // Populate the sysinfo struct with memory-related // stuff si_meminfo(&i); // Add swap information si_swapinfo(&i); // ... start displaying show_val_kb(m, "MemTotal: ", i.totalram); show_val_kb(m, "MemFree: ", i.freeram); // ... }

As expected, no single reference to namespaces (or cgroups).

Also, si_meminfo , the method that fills the sysinfo interface takes some global values and bring it to /proc/meminfo , has no idea about cgroups either:

/** * The struct that holds part of the memory information * that ends up being displayed in the end. */ struct sysinfo { __kernel_long_t uptime; /* Seconds since boot */ __kernel_ulong_t loads[3]; /* 1, 5, and 15 minute load averages */ __kernel_ulong_t totalram; /* Total usable main memory size */ __kernel_ulong_t freeram; /* Available memory size */ __kernel_ulong_t sharedram; /* Amount of shared memory */ __kernel_ulong_t bufferram; /* Memory used by buffers */ __kernel_ulong_t totalswap; /* Total swap space size */ __kernel_ulong_t freeswap; /* swap space still available */ __u16 procs; /* Number of current processes */ __u16 pad; /* Explicit padding for m68k */ __kernel_ulong_t totalhigh; /* Total high memory size */ __kernel_ulong_t freehigh; /* Available high memory size */ __u32 mem_unit; /* Memory unit size in bytes */ char _f[20-2*sizeof(__kernel_ulong_t)-sizeof(__u32)]; /* Padding: libc5 uses this.. */ }; /** * Fills the `sysinfo` struct passed as a pointer * with values collected from the system (globally * set). */ void si_meminfo(struct sysinfo *val) { val->totalram = totalram_pages; val->sharedram = global_node_page_state(NR_SHMEM); val->freeram = global_zone_page_state(NR_FREE_PAGES); val->bufferram = nr_blockdev_pages(); val->totalhigh = totalhigh_pages; val->freehigh = nr_free_highpages(); val->mem_unit = PAGE_SIZE; }

Interesting fact: totalram_pages (reported from MemTotal ) can change - see this StackOverflow question: Why does MemTotal in /proc/meminfo change? .

Who’s controlling the allocation of memory?

If you’re now wondering where we end up reaching that limit that we set in the cgroup, we need to look at the path that a memory allocation takes.

alloc.out (our process) | | *--> task_struct (process descriptor) | | *--> mm_struct (memory descriptor) | | m_cgroup <------* | +------> page_counter memory | | | *--> { atomic_long_t count, unsigned long limit } | | *------> page_counter swap

Within the Kernel, each process created (in our case, alloc.out ) is referenced internally via a process descriptor task_struct :

struct task_struct { struct thread_info thread_info; // ... unsigned int cpu; struct mm_struct *mm;

Such process descriptor references a memory descriptor mm defined as mm_struct :

struct mm_struct { struct vm_area_struct *mmap; /* list of VMAs */ unsigned long mmap_base; /* base of mmap area */ unsigned long task_size; /* size of task vm space */ // ... #ifdef CONFIG_MEMCG struct mem_cgroup *mem_cgroup; #endif }

Such memory descriptor references a mem_cgroup , a data structure that keeps track of the cgroup semantics for memory limiting and accounting:

struct mem_cgroup { struct cgroup_subsys_state css; /* Private memcg ID. Used to ID objects that outlive the cgroup */ struct mem_cgroup_id id; /* Accounted resources */ struct page_counter memory; struct page_counter swap; // ... }

Such cgroup data structure then references some page counters ( memory and swap , for instance) defined via the page_counter struct , which are responsible for keeping track of usage and providing the limiting functionality when someone tries to acquire a page:

struct page_counter { atomic_long_t count; unsigned long limit; // The parent CGROUP (remember, cgroups are // hierarchical!) struct page_counter *parent; // ... };

Whenever a process needs some pages assigned to it, page_counter_try_charge goes through the cgroup memory hierarchy, trying to charge a given number of pages, which in case of success (new value would be smaller than the limit), it updates the counts, otherwise, it triggers OOM behavior.

Using bcc to trace page_counter_try_charge , we can see how the act of page_fault ing leads to mem_cgroup_try_charge calling page_counter_try_charge :

25641 25641 alloc.out page_counter_try_charge page_counter_try_charge+0x1 [kernel] mem_cgroup_try_charge+0x93 [kernel] handle_pte_fault+0x3e3 [kernel] __handle_mm_fault+0x478 [kernel] handle_mm_fault+0xb1 [kernel] __do_page_fault+0x250 [kernel] do_page_fault+0x2e [kernel] page_fault+0x45 [kernel] Tracing a cgroup running out of memory

If we’re even more curious and decide to trace the page_counter_try_charge arguments, we can see the tries failing in the case when we’re within a container and try to grab more memory than we’re allowed to.

Using bpftrace , we’re able to tailor a small program that inspects the page_counter used in page_counter_try_charge and see how the limit changes over time (until the point that we reach the exhaustion - receiving an OOM then).

#include <linux/page_counter.h> BEGIN { printf("Tracing page_counter_try_charge... Hit Ctrl-C to end.\n"); printf("%-8s %-6s %-16s %-10s %-10s %-10s\n", "TIME", "PID", "COMM", "REQUESTED", "CURRENT", "LIMIT"); @epoch = nsecs; } kprobe:page_counter_try_charge { $pcounter = (page_counter*)arg0; $limit = $pcounter->limit; $current = $pcounter->count.counter; $requested = arg1; printf("%-8d %-6d %-16s %-10ld %-10ld %-10ld\n", (nsecs - @epoch) / 1000000, pid, comm, $requested, $current, $limit ); }

Running the tracer with a shell session put into the cgroup that limits our memory, we can see it running out of pages:

sudo bpftrace ./try-charge-counter.d Attaching 2 probes... Tracing page_counter_try_charge... Hit Ctrl-C to end. TIME PID REQUESTED CURRENT LIMIT ... 3301 25980 32 1288 2560 3302 25980 32 1320 2560 ... 3307 25980 1 2553 2560 3307 25980 32 2554 2560 .--------------------. 3307 25980 1 | 2554 2560 | 3308 25980 32 | 2555 2560 | 3308 25980 1 | 2555 2560 | 3308 25980 32 | 2556 2560 | 3308 25980 1 | 2556 2560 | 3308 25980 32 | 2557 2560 | 3308 25980 1 | 2557 2560 | 3308 25980 32 | 2558 2560 | *----------.---------* | still possible to increase the number of pages ... 3308 25980 1 2558 2560 3308 25980 32 2559 2560 3308 25980 1 2559 2560 3308 25980 32 2560 2560 * LIMIT REACHED 3308 25980 1 2560 2560 * 3308 25980 1 2560 2560 * | | *-----.----* | Whoopsy, can't allocate <------* anymore! Closing thoughts

Although I’ve understood that meminfo wasn’t namespaced, it wasn’t clear for my why.

Going through the exercise of tailoring a quick program to inspect the arguments passed to page_counter_try_charge was very interesting (and easier than I thought!).

Shout out to bpftrace once again for allowing us to go deep into the Kernel with ease!

If you have any further questions, or just want to connect, let me know! I’m cirowrc on Twitter.

Have a good one!

Resources Understanding the Linux Kernel, 3rd Ed The C Programming Language The Linux Programming Interface

本文系统(linux)相关术语:linux系统 鸟哥的linux私房菜 linux命令大全 linux操作系统

tags: memory,page,struct,cgroup,gt,counter,kernel
分页:12
转载请注明
本文标题:Why top and free inside containers don't show the correct container memory
本站链接:https://www.codesec.net/view/604906.html


1.凡CodeSecTeam转载的文章,均出自其它媒体或其他官网介绍,目的在于传递更多的信息,并不代表本站赞同其观点和其真实性负责;
2.转载的文章仅代表原创作者观点,与本站无关。其原创性以及文中陈述文字和内容未经本站证实,本站对该文以及其中全部或者部分内容、文字的真实性、完整性、及时性,不作出任何保证或承若;
3.如本站转载稿涉及版权等问题,请作者及时联系本站,我们会及时处理。
登录后可拥有收藏文章、关注作者等权限...
技术大类 技术大类 | 系统(linux) | 评论(0) | 阅读(9)