Hugepages and KSM on Linux

Searching for Linux optimization some time ago I got two cool features: huge pages and ksm.

Huge pages

The Linux Virtual Memory subsystem (VM) provides memory to the system in blocks, called pages. By default, each page has 4 kilobytes of size. In theory, a system with 2GiB of RAM can allocate 524,288 pages with 4KiB of size. Along with the payload, pages carries some control bits. These bits are scanned by kscand so that VM subsystem can manage the page life cycle. Each page has an entry in the page table. Thus, as much pages you have, more resources will be demanded to manage all of them.

That way, Linux can allocate untill 4 MiB page size for x86 systems and 2 MiBpage size for x86_64. You can enable it defining into your .config file of kernel source:

CONFIG_HUGETLBFS=y
CONFIG_HUGETLB_PAGE=y

Once you have a kernel compiled with this two options enabled, you can allocate a number of hugepages using sysctl tool:

sysctl -w vm.nr_hugepages = 10

This will tell Linux to reserve space to 10 hugepages. You can check it out:

[root@localhost ~] grep ^Huge /proc/meminfo
HugePages_Total: 10
HugePages_Free: 10
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB

In case of x86_64 system, it means 2 MiB * 10 = 20 MiB. That pages are allocated contiguously. So, it is recommended you allocate huge pages in the boot time. Huge pages can’t be moved to swap space too.

It is commonly use in virtualization hosts and database servers.

KSM

The Kernel Samepage Merging is a mechanism which is used mainly by hypervisiors to save memory. When it is enabled, a kernel thread scans memory searching for pages with the same content but with different owner. When it occurs, Linux merges them and maps both to the same location. The new common page is marked as copy-on-write too. As soon as a process need to motify that page, Linux breaks it again into two pages. To use KSM, the kernel have to be compiled with CONFIG_KSM=y. Once the kernel is compiled with KSM, you can enable KSM to scan with:

echo 1 > /sys/kernel/mm/ksm/run

Now, your kernel will scan 100 pages each 20 millisecs by default. You can modify this writing into the files pages_to_scan and sleep_millisecs that you can find in /sys/kernel/mm/ksm folder. Monitor the KSM:

eduardo@symphony:~$ ( cd /sys/kernel/mm/ksm/ ; while true ; do clear ; for i in * ; do echo -n "${i}: " ; cat ${i} ; done ; sleep 1 ; done )

For more details, find hugetlbpage.tx and ksm.txt in Documentation folder of your Linux source.

LVS and keepalived – An example

The purpose of the post is just show an example of a Linux box operating as a loadbalancer with LVS and keepalived together.

Briefing:

LVS, or Linux Virtual Server, is a feature of the Linux Kernel for loadbalance services in a Linux box. Check the official site for more.

keepalived is a routing software that implement VRRP in order to manage dynamic gateway and failure. Here you can read more.

The components:

  • Two Linux box with LVS and keepalived, one master and another slave, in case of failure of the master.
  • Two Webservers. I will assume you already have the webservers configured and working.
LVS
LVS overview

Configuration

LVS is a kernel feature, but in order to handle it, ipvsadm package is needed. In CentOS, you can resolve with:

yum install -y ipvsadm keepalived

keepalived is also shipped with most of the distros.

Once you installed both, ipvsadm and keepalive, lets configure them. Below follows the /etc/keepalived/keepalived.conf commented:

global_defs {
  router_id LVS_DEVEL
}

router_id is just a string to identify.

vrrp_instance VI_1 {
    state MASTER
    interface eth0
    virtual_router_id 51
    priority 100
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass 1111
    }
    virtual_ipaddress {
        100.10.20.50
    }
}

Above is defined the instance of the virtual router. In the first machine it will have the state as MASTER. In the second machine, use SLAVE instead. Another parameter should be different in both, the virtual_router_id. The number does not matter, just keep it different. The priority will make sense when you have 2 slaves. It defines what SLAVE should become MASTER first.
The authentication section is needed to the keepalived servers trust each other.
The virtual_ipaddress is the IP that the clients will connect to.

virtual_server 100.10.20.50 80 {
    delay_loop 15
    lb_algo rr
    lb_kind DR
    persistence_timeout 50
    protocol TCP

    real_server 100.10.20.55 80 {
        HTTP_GET {
            path /
            status_code 200
        }
    }
    real_server 100.10.20.56 80 {
        HTTP_GET {
            path /
            status_code 200
        }
    }
}

The delay_loop defines the timer for polling service in seconds. lb_algo is round-robind. lb_kind is direct routing, no need tunneling or nat. persistence_timeout is the persistence time in seconds for LVS. The real_server sections define the webservers. Both have the HTTP_GET checker. keepalived will do a http get at the ‘/’ path and expect a 200 code. If it’s not ok, keepalived will disable forwarding till receive 200 status code.

As soon as keepalived is started, it will configure the VIP in the interface defined at vrrp_instance.

Now in the webservers, a dummy interface must be configured. So they can receive packages destinated to the VIP 100.10.20.50. First load the dummy module:

modprobe dummy

Create the dummy interface and configure it:

ip link add dummy0 type dummy
ip link set dummy0 up
ip addr add 100.10.20.50/32

Source based routing

Most of network routing is based on the destination. But sometimes you may need to forward packets to different gateways depending on the source.

In Linux you can do this using the iproute2 package. It uses netlink socket interface in order to handle addressment, routing, queuing and scheduling of Linux network subsystem. Follow an example:

Define a lable for a table to be used:

echo "10 foo" >> /etc/iproute2/rt_tables

Insert a route into foo table:

ip route add default via 10.10.10.1 dev eth1 table foo

Insert a rule with low priority in order to a host consult the new table foo:

ip rule add prio 10 from 192.168.16.7 lookup foo

You can check the rules with:

ip rule show list

Use the man for more information.

Shell tricks

There are many keys or commands that could turn easier our lives. That’s some:

Read text file inside tar.xz file:

 cat samba-4.0.9.tar.xz | tar -JxO samba-4.0.9/source4/scripting/bin/samba_backup | less 

Command correction:

nkdir -v /tmp/foo
bash: nkdir: command not found
^nkdir^mkdir
mkdir foo 

See the difference of file in remote machines:

 diff <(ssh server1 'cat file') <(ssh server2 'cat file') 

Or installed packages:

 diff <(ssh server1 'rpm -qa | sort') <(ssh server2 'rpm -qa | sort') 

You have an alias with the same name of a command, but you want to run the command, not alias:

 alias vi=vim
\vi 

You can see that ones and many others here.

Partition shrink

Several times we need re-size our storage area. Normally, we expand volumes, but never shrink. Although it’s not common, this is possible too. Surfing on the web, I found that great article.

My tests worked gracefully! I extended this article above re-sizing the virtual disk image file, with qemu-img.

qemu-img convert -f qcow2 -O raw resize.img resize_raw.img
qemu-img resize resize_raw.img 5360321024
qemu-img convert -f raw -O qcow2 resize_raw.img resize.img

5360321024 is exactly the size in bytes of the sum of all partitions.

LinuX Containers in Slackware64

Linux Containers, or LXC, is a type of virtualization in process level. It uses cgroup feature of Linux to work. This way, it creates namespaces to create isolated environment for its process. More details can be found here.

This is a lightweight model of virtualization, because it isn’t a full virtual machine. The same kernel that runs on the host, serves directly the isolated process. Using containers have security impact, once the process encapsulated doesn’t see processes outside of container, and several times you run just needed services in it.

As I said before, the magic happens using cgroup kernel feature, so it must be enabled. The lxc userspace tool must be installed as well. Some templates comes with lxc tool, for debian, fedora, ubuntu… but I never tested it. I have tested on my Slackware64 14.0, following this template. Worth reading.