Container_Docker
MicroSvcs container
Container concepts
Def: Container is a special type of process with namespace based separation and the amount of resources it could use is defined by Cgroup.
Namespace
Def: Used to create separate view of resources.
Categories of namespaces
Namespace
Separated resource
Cgroup
Cgroup root directory
IPC
System V IPC, POSIX message queues
Network
Network equipment, port
Mount
Mount point
PID
Process ID
Time
Clock
User
User ID and user group ID
UTS
Machine name, host name
Compare with hypervisor separation
Cons of hypervisor
Hypervisor must run an independent guest OS, which will cost 100~200MB memory by itself.
User process runs inside supervisor and all operations need to be intercepted by hypervisor, resulting in performance cost.
On the contrary, since container is just another process, there isn't much performance cost.
Cons of container
Containers share the same kernel as host.
If you want to use a higher version container on a lower version Linux host, it is not possible.
If you want to use Linux on top of Windows host, it won't be possible.
Many resources and objects could not be separated using namespace, such as time.
If your container use SetTimeOfDay and changed the time in the container, then the host's time will also change.
Containers expose more security attack surface than hypervisor. Even though technology such as Seccomp could be used, it has performance cost.
Commands
Internals
Cgroup
Def: Used to create resource constraints.
Categories
Blkio Cgroup
Question: How to guarantee the disk read/write performance when multiple containers read/write?
Disk performance criteria:
IOPS: Input/Output operations per second.
Throughput: Bandwidth in MB/s.
Relationship: Throughput = IOPS * blocksize
Def of Blkio Cgroup: A subsystem under Cgroup.
Two Linux I/O modes:
Direct I/O
Buffered I/O
Cgroup v1 and v2
Under Cgroup v1, each subsystem is independent.
Under Cgroup v2, one process could belong to multiple control group. Each control group could contain multiple evaluation criteria (e.g. Blkio Cgroup + Memory Cgroup)
Cons of Cgroup
/Proc directory stores the core status, such as CPU and memory usage. But when using top command, it displays the host file system's information.
/Proc does not have any knowledge about Cgroup.
Rootfs
Def
A directory on the system that looks like the standard root ( / ) of the operating system.
It contains file system and configuration files.
But does not include operating system kernels.
Typical files under rootfs
rootfs structure: All seven layers will be mounted under /var/lib/docker/aufs/mnt.
Lowest layers: readonly+whiteout
Mid layers: used to store /etc/hosts, /etc/resolv.conf. When docker commit is executed, these config files could be avoided.
Higher layers: read write. Used to store the incremental changes when modifying the rootfs.
Mount points
Def: Unix file system is organized into a tree structure. Storage devices are attached to specific locations in that tree. These locations are called mount points.
A mount point contains three parts:
The location in the tree
The access properties to the data at that point (for example, writability)
The source of the data mounted at that point (for example, a specific hard disk, USB device, or memory-backed virtual disk)
UnionFS
Motivation
If without a container specific file system, then file systems such as XFS or ext4 need to be used. For these file systems, the entire system needs to be downloaded to each container, resulting in much redundancy.
Implementation
UnionFS has many implementations, including Docker's AUFS and OverlayFS. Since Linux 3.18, OverlayFS has been part of Linux and default container file system impl.
OverlayFS is a modern union filesystem that is similar to AUFS, but faster and with a simpler implementation.
Lower layer is readonly.
Upper layer is writable and modifiable.
Limitations
For write, it uses Copy-On-Write mechanism, resulting in low efficiency; For read, it also needs to read from top down.
The RW layer has the same lifetime as the container. When container stops, the RW layer will disappear. If you need to store the RW layer, you could commit it to the image.
There is no mechanism for sharing the data.
Storage quota
Question: How to set quota for a directory?
Solution:
Tag a project ID on the upperdir
Set XFS quota on the project.
Docker storage
Types
Pros
Cons
Bind mounts
Most straightforward/flexible.
Must explicitly specify a file path on host
Volumes
Cross disk/file system; Docker manage volumes, no need to worry about conflict;
Data exist on host could not easily be shared to containers.
tmpfs mounts
High performant; Secure
Could not share among multiple containers
Bind mounts (host path)
Use case
Bind mounts are useful when the host provides a file or directory that is needed by a program running in a container, or when that containerized program produces a file or log that is processed by users or programs running outside containers
History: Exist since Linux 2.4 kernel 2001.
Pros
Much more performant than unionfilesystem.
Across file systems
Across disks
Cons
It ties otherwise portable container descriptions to the filesystem of a specific host.
It creates an opportunity for conflict with other containers
Command
Docker volumes
Use case: Using volumes is a method of decoupling storage from specialized locations on the filesystem that you might specify with bind mounts.
Pros
User does not need to remember a hardcoded hostpath. It only needs to use the correct volume name.
Command
docker volume create/rm/ls/inspect/prune [-d local] volName
-v volumeName:containerPath
-v containerPath
--mount type=volume, src={volumeName}, dest={containerPath}
How avoid being docker commited
Problem: Will the changes inside /test directory being commited by docker?
Solution:
Data inside volume will not be commited by "docker commit" command.
In-memory storage
Use case
Most service software and web applications use private key files, database passwords, API key files, or other sensitive configuration files, and need upload buffering space.
In these cases, it is important that you never include those types of files in an image or write them to disk.
Internal mechanism
Linux tmpfs:
tmpfs volume
Linux network
Linux Net namespace
Net namespace will have an independent network stack including
Network interface controller
IP/MAC
ARP
iptables/ipvs
Socket
Network parameters with namespace attributes
Connect two net namespaces
Linux veth pair
Connect multiple net namespaces
Problem: Use veth pair will result in exponential number of pairs
Linux bridge
Routing table
Docker network - Flannel
Use case: Cross node communication
UDP based flannel
Idea
A three layer overlay network by encapsulating the original IP packages under UDP protocol.
Process
Steps with flowchart
Container A's address is 172.17.8.2/24.
Container A wants to visit container B 172.17.9.2.
Container A has a default routing rule: default via 172.17.8.1 dev eth0
The network package is sent to docker0 bridge (172.17.8.1) according to routing rule.
Physical machine A has a routing rule: 172.17.0.0/24 via 172.17.0.0 dev flannel.1
A package sent to container B 172.17.9.2 will be transferred to network card flannel.1 (A virtual network card created by process flanneld).
flanneld in machine A will encapsulate network package inside UDP, with machine A/B's IP addresses.
flanneld in machine A will send it to flanneld in machine B.
flanneld in machine B will extract the package.
Physical machine B has a routing rule: 172.17.9.0/24 dev docker0 proto kernel scope link src 172.17.9.1
Limitations
Requires copy operation between kernel and user space for three times because packages need to pass through TUN device (flannel0)
User's IP packet passes through docker0 bridge.
IP packet passes through TUN device (flannel0) and to flanneld process inside user space.
IP packet gets encapsulated inside UDP protocol and enters kernel space.
VXLAN based flannel
Idea
VXLAN builds a overlay network on top of containers and virtual machines, making them talk to each other freely. It replaces flanneld process with a VTEP device instead of TUN device which operates on the second network layer - ethernet frame.
The benefit is that VXLAN avoids the switch between user and kernel space as in UDP based flannel because VXLAN is a linux kernel mode.
Process
host-gw based flannel
Idea
Set the next hop of each flannel subset (e.g. 10.244.1.0/24) as host machine's IP address. Host machine will also serve the role of gateway, and that's how this approach is named.
When IP packet is packaged as an ethernet frame and sent out, it will use the next hop address inside routing table to set destination MAC address.
Flannel subnet and host machine information are all stored inside etcd.
Flanneld only needs to watch corresponding etcd directory and update routing table.
Process
Limitation
It requires that host machines are connected to each other in the second network layer.
When host machines are located in different VLANs, they are not connected in the second network layer.
Calico
Idea
Instead of using the etcd to store routing information, it uses the BGP protocol. By not going through overlay network, it avoids the performance cost from the additional abstraction layer.
BGP (Border gateway protocol): Store routing information across different autonomous system. It is far more scalable than each machine storing their own routing information.
Components
CNI plugin: The place where Calico connects with Kubernetes.
Felix: DaemonSet responsible for writing routing rules on host machines and maintain Calico's network devices.
Bird: BGP's client, responsible for distributing routing information.
Modes
Node-to-Node Mesh
Each BGP client on the host machine needs to establish connection with all other BGP clients.
Cons: As the number of nodes increases, the number of connection will grow exponentially. So typically this mode is used in clusters with fewer than 100 nodes.
Route Reflector
Designate specific nodes to establish BGP connection with other nodes to learn global routing rules. All other nodes only need to talk with these few nodes.
IPIP
Set host machine as BGP peers
Process
Docker network
Host
Bridge
Docker file
Write a dockerfile
When each docker file statement runs, an image layer will be generated.
Docker commands
References
Real world
Netflix container journey: https://netflixtechblog.com/the-evolution-of-container-usage-at-netflix-3abfc096781b
Last updated