Mobycast | Virtual Machines vs. Containers Revisited

Containers are just lightweight virtual machines, right? No, not really. There's much more to the story than that, so we decided to do a four-part series on virtual machines versus containers.

In parts 1 and 2, we discussed virtual machines in detail and how they work. Now, in parts 3 and 4, we turn our attention to containers. Turns out, containers are not very complicated. They are just normal Linux processes with some isolation superpowers.

In today's episode of Mobycast, Jon and Chris go into depth on containers, their history and the underlying operating system technologies that make them possible. If you ever wondered why you can't run Windows containers on a Linux host, this episode will clear up the mystery.

Show Notes

In this episode, we cover the following topics:

Operating-system-level virtualization = containers
- Allows the resources of a computer to be partitioned via the kernel
  - All containers share single kernel with each other AND the host system
- Depend on their host OS to do all the communication and interaction with the physical machine
  - Containers don't need a hypervisor; they run directly within the host machine's kernel
- Containers are using the underlying operational system resources and drivers
  - This is why you cannot run different OSes on the same host system
    - i.e. Windows containers can run on Windows only, and Linux Containers can run on Linux only
  - What we think of different OSes (RHEL, CentOS, SUSE, Debian, Ubuntu) are not really different...
    - They are all same core OS (Linux), they just differ in apps/files
- Based on the virtualization, isolation, and resource management mechanisms provided by the Linux kernel
  - namespaces
  - cgroups
Container history
- FreeBSD Jails (2000)
  - BSD userland software that runs on top of the chroot(2) system call
    - chroot is used to change the root directory of a set of processes
  - Processes created in the chrooted environment cannot access files or resources outside of it
  - Jails virtualize access to the file system, the set of users, and the networking subsystem
  - A jail is characterized by four elements:
    - Directory subtree: the starting point from which a jail is entered
      - Once inside the jail, a process is not permitted to escape outside of this subtree
    - Hostname
    - IP address
    - Command: the path name of an executable to run inside the jail
  - Configured via jail.conf file
- LXC containers (2008)
  - Userspace interface for the Linux kernel features to contain processes, including:
    - Kernel namespaces (ipc, uts, mount, pid, network and user)
    - Apparmor and SELinux profiles
    - Seccomp policies
    - Chroots (using pivot_root)
    - Kernel capabilities
    - CGroups (control groups)
- Docker containers (2014)
  - Early versions of Docker used LXC as the container runtime
  - LXC was made optional in v0.9 (March 2014)
    - Replaced by libcontainer)
    - libcontainer became the core of runC
  - LXC was dropped in v1.10 (February 2016)
Container technology
- Containers are just processes. So what makes them special?
- Namespaces
  - Restrict what you can SEE
  - Virtualize system resources, like the file system or networking
    - Makes it appear to processes within the namespace that they have their own isolated instance of resource
    - Changes to the global resource only visible to processes that are members of the namespace
  - Processes inherit from parent
  - Linux provides the following namespaces:
    - IPC (interprocess communications)
      - CLONE_NEWIPC: Isolates System V IPC, POSIX message queues
    - Network
      - CLONE_NEWNET: Isolates network devices, stacks, ports, etc
    - Mount
      - CLONE_NEWNS: Isolates mount points
    - PID
      - CLONE_NEWPID: Isolates process IDs
    - User
      - CLONE_NEWUSER: Isolates user and group IDs
    - UTS (Unix Timesharing System)
      - CLONE_NEWUTS: Isolates hostname and NIS domain name
    - Cgroup
      - CLONE_NEWCGROUP: Isolates cgroup root directory
  - Syscall interface
    - System call is the fundamental interface between an app and the Linux kernel
      - i.e. Linux kernel calls to create/enter namespaces for processes
- Control groups (cgroups)
  - Restrict what you can DO
  - Limits an application (container) to a specific set of resources like CPU and memory
  - Allow containers to share available hardware resources and optionally enforce limits and constraints
  - Creating, modifying, using cgroups is done through the cgroup virtual filesystem
  - Processes inherit from parent
  - Can be reassigned to different cgroups
    - Memory
    - CPU / CPU cores
    - Devices
    - I/O
    - Processes
  - Using cgroups
    - To see mounted cgroups:
      - mount | grep cgroup
    - To create a new cgroup:
      - mkdir /sys/fs/cgroup/cpu/chris
    - To set "cpu.shares" to 512:
      - echo 512 > /sys/fs/cgroup/cpu/chris/cpu.shares
    - Now add a process to this cgroup:
      - echo <get_pid> > /sys/fs/cgroup/cpu/chris/cgroup.procs
Pseudo code: Creating a container
- Steps:
  - Create root filesystem for container
    - Spin up busybox in Docker container, and then export filesystem
  - Run "launcher" process that sets up "child" namespace
  - Launcher process forks new child process (now under new namespaces)
    - Child process then forks new process for container
      - chroot (to our root filesystem)
      - mount any other FS
      - set cgroups (e.g. apply CPU constraints)

Links

End Song
Bettie Black & Sophia - Something Beautiful

For a full transcription of this episode, please visit the episode webpage.

We'd love to hear from you! You can reach us at:

Web: https://mobycast.fm
Voicemail: 844-818-0993
Email: ask@mobycast.fm
Twitter: https://twitter.com/hashtag/mobycast

What is Mobycast?

A Podcast About Cloud Native Software Development, AWS, and Distributed Systems