惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

云风的 BLOG
云风的 BLOG
TaoSecurity Blog
TaoSecurity Blog
V
Visual Studio Blog
The GitHub Blog
The GitHub Blog
Apple Machine Learning Research
Apple Machine Learning Research
Vercel News
Vercel News
The Register - Security
The Register - Security
月光博客
月光博客
M
MIT News - Artificial intelligence
B
Blog RSS Feed
博客园 - 叶小钗
Last Week in AI
Last Week in AI
Application and Cybersecurity Blog
Application and Cybersecurity Blog
T
The Blog of Author Tim Ferriss
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Microsoft Azure Blog
Microsoft Azure Blog
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
C
Check Point Blog
Attack and Defense Labs
Attack and Defense Labs
The Cloudflare Blog
Cloudbric
Cloudbric
O
OpenAI News
Security Archives - TechRepublic
Security Archives - TechRepublic
Help Net Security
Help Net Security
Google DeepMind News
Google DeepMind News
Stack Overflow Blog
Stack Overflow Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
V
V2EX
大猫的无限游戏
大猫的无限游戏
www.infosecurity-magazine.com
www.infosecurity-magazine.com
V2EX - 技术
V2EX - 技术
Google Online Security Blog
Google Online Security Blog
博客园 - Franky
雷峰网
雷峰网
J
Java Code Geeks
L
LINUX DO - 最新话题
T
Tenable Blog
爱范儿
爱范儿
Engineering at Meta
Engineering at Meta
T
Tailwind CSS Blog
Spread Privacy
Spread Privacy
H
Heimdal Security Blog
S
Schneier on Security
量子位
N
Netflix TechBlog - Medium
G
Google Developers Blog
T
The Exploit Database - CXSecurity.com
Cyberwarzone
Cyberwarzone
F
Full Disclosure
S
Securelist

Posts on Noah Bailey

How to turn anything into a router Deploy to Cloudfront from GitHub using OpenID Connect Backup Postgres databases with Kubernetes CronJobs The spelling error made 200 billion times a day Restarting Kubernetes pods using a CronJob You've just bought a new domain. Now what? Who Sawed My Motherboard??? Linux on the P8 Aliexpress Mini Laptop Recovering Mysql/Mariadb after a nasty crash Using EXIF data to pick my next lens Converting and developing RAW photos on Linux automatically Thank you, 2016 iPhone Don't Make It Work Self-hosted Surveillance with ZoneMinder Backups, Monitoring, and Security for small Mastodon servers Block web scanners with ipset & iptables Executing commands over SSH with GitHub Actions Debian Sid on encrypted ZFS Protect your dangerously insecure redis server Debian: the luxurious boring lifestyle Monitor radiation with a Raspberry Pi Simple Linux server alerts: Know your performance, errors, security, syslog, and security NUC crashes on debian 11 - How I fixed it Basic Linux server security with fail2ban, ossec, and firewall Windows 11 will create heaps of needless trash Domesticated Kubernetes Networking The Cursed Certificate Our mostly disposable and entirely stupid world Trying out OpenBSD (as a Linux geek) Making VoIP Calls with Antique Rotary Phones Monitoring WAN speed with speedtest-cli and ElasticSearch Monitoring WAN latency with InfluxDB The Zeroshell botnet returns Installing Gentoo on a vintage Thinkpad T60 Malware emails 2: Russian boogaloo TP-Link Device Weirdness ElasticSearch broke all my nice things (a story of cascading failure) A New Botnet is Targeting Network Infrastructure Malware on the Wire: Monitoring Network Traffic with Suricata and ClamAV Cloud Threat Protection with OSSEC and Suricata Malware Emails From Jerks Surviving the Apocalypse with an Offline Wikipedia Server Being Attacked by Bots Linux Router, Firewall and IDS Appliance You Probably Don't Need a VPN Fix an Oversharded Elasticsearch Cluster Automating KVM Virtualization Update all your linux servers as fast as possible Cleanup Systemd Journald Storage Stop Putting Your SSH Keys on Github! Stealing Windows Sessions FreeRadius Active Directory Integration Retrieving WPA2 Keys on Windows Deploy MDT Litetouch on Linux with TFTPD and Syslinux Generating MSI transform files with Orca The Inflatable Dinghy Generating Cisco IOS config files with Python Homebrew SAN Getting Cloudy
Clustering KVM with Ceph Storage
2019-03-05 · via Posts on Noah Bailey

I have, for a long time, been fascinated and terrified by “Virtual SAN” solutions.

The idea of combining storage and compute seems on the surface very attractive. It allows us to scale out our storage and compute together or separately in relatively small and affordable units, helping avoid the sticker shock of the upfront cost of storage systems. And as somebody especially prone to capex-phobia, that really is a great solution.

However, on a technical level there are some major shortcomings of this type of infrastructure. For one, storage failures are by far the most feared and devastating that can happen to any individual or organization, and housing that on the relatively volatile virtual host layer seems like a very bad idea. Furthermore, many of the commercial solutions have very strict requirements on the type of hardware that can be used, and very vague documentation on how to recover the system from any sort of degraded state.

In particular, Microsoft’s Storage Spaces Direct (S2D) solution has a lack of meaningful documentation, most of it coming off more as a sales pitch than as a technical document for engineers and architects. This type of marketecture seems to be quite common in this space, with VMware’s VSAN suffering the same lack of useful information to a lesser (but still irritating) degree. And of course, there’s a slew of other systems that are more or less effective and documented.

Build it yourself

“You want a good truck, you’re gonna have to build it yourself,” as my grandfather said referring to his Chevy/Ford/fiberglass/fabricated creation. Sometimes the right system is a mixture of different off-the-shelf standard parts with a few globs of glue between them. And that goes beyond old-school farmers building their equipment out of ‘junk.’ This type of engineering is tried and true. The reason those old tractors are so reliable is not because we forgot how to build good quality stuff – It’s because we forgot how to engineer simple stuff. The solution then is to strip away all the complexities and cruft and build a very simple cluster for hosting and managing virtual workloads.

After some research and testing, this is what I’ve come up with:

Hypervisor Storage Provisioning Management
KVM CephFS cloud-config virsh /
virt-manager /
kimchi web ui

Creating the cluster

This simple cluster needs only three nodes to start. It could be as small as two, but that may limit the ability to scale out later. More on that in a bit.

Here is the reference layout:

Roles Node 1 Node 2 Node 3
OSD osd1 osd2 osd3
MON mon1 mon2 mon3
MDS primary standby standby
KVM Enabled Enabled Enabled

In terms of storage, each node should have a small SSD for the OS and system software, as well as an SSD to be dedicated to Ceph. Though Ceph can use a partition or LVM slice, it’s much better to give it raw access to a physical device. Not only will it improve performance by not having to layer inodes and filesystems (cough cough GlusterFS), it will make the system overall much more stable by avoiding additional layers of complexity. After all, that’s what this is all about.

Preparing the nodes

In my setup, I am using three Ubuntu server 18.04 servers, all running on VMware workstation. Use your favourite distro, but be aware that some of my documentation may not line up with your system.

Make sure that they’re all up to date, stable, and have as close to identical hardware as possible. Ceph by nature requires evenly matched servers to optimally place and replicate data. Also, ensure that NTP is synced on all nodes with as much precision as possible.

A reliable DNS resolver is also recommended, though modifying the host file is also possible. Either way, make sure it is working and that all the nodes can ping each other before proceeding.

The first stage of this process will be run on a management node, which can be a server, workstation, virtual machine or laptop. Ideally it should either be a permanent server installation (fourth node) or a virtual machine that can be backed up and archived once the process is complete.

First, install the ceph deployment tools on the manager node:

wget -q -O- 'https://download.ceph.com/keys/release.asc' | sudo apt-key add -
sudo apt-add-repository 'deb https://download.ceph.com/debian-luminous/ bionic main'
sudo apt update
sudo apt install ceph-deploy

On each server, create a cephsvc user account:

sudo useradd -d /home/cephsvc -s /bin/bash -m cephsvc
sudo passwd cephsvc

This user also needs passwordless sudo on each system:

echo "cephsvc ALL = (root) NOPASSWD:ALL" | sudo tee /etc/sudoers.d/cephsvc
sudo chmod 0440 /etc/sudoers.d/cephsvc

Install the prerequisite python on each node (osd1, osd2, osd3):

sudo apt-add-repository universe
sudo apt update && sudo apt install python-minimal -y

Generate a passwordless SSH key (required for ceph-deploy) for the admin workstation:

ssh-keygen

And copy the public key to each server:

ssh-copy-id cephsvc@osd1
ssh-copy-id cephsvc@osd2
ssh-copy-id cephsvc@osd3

Then, configure the SSH client to use this remote user and key:

~/.ssh/config

Host osd1
    Hostname osd1
    User osdsvc
Host osd2
    Hostname osd2
    User osdsvc
Host osd3
    Hostname osd3
    User osdsvc

Bootstrap the cluster

Create the data directory:

mkdir ~/ceph
cd ~/ceph

Specify inital monitor nodes for install:

ceph-deploy new osd1 osd2 osd3

In ~/ceph/ceph.conf specify the network of the ceph cluster. Though some documentation indicates this is not mandatory, it appears to fail during monitor deployment if this isn’t specified explicitly.

public network = 10.204.10.0/24

Install the ceph packages on the nodes:

ceph-deploy install osd1 osd2 osd3

Deploy monitors and gather keys:

ceph-deploy mon create-initial

Install the ceph keys and cluster configuration to each node:

ceph-deploy admin osd1 osd2 osd3

Install the manager node:

ceph-deploy mgr create osd1

Provision storage

Create three OSDs. These will claim and overwrite any contents of the specified disk. Be careful!

ceph-deploy osd create --data /dev/sdb osd1
ceph-deploy osd create --data /dev/sdb osd2
ceph-deploy osd create --data /dev/sdb osd3

Check the health of the cluster

ssh osd1 sudo ceph health

Metadata Service

At least one metadata node is required to use CephFS, which this cluster will depend on.

To make sure this cluster is fully redundant, all three nodes will be MDS. Only one will be active at a time.

ceph-deploy mds create osd1 osd2 osd3

Manager Nodes

At least one manager is required. It is recommended to have several in a cluster for high availability. In this case, add additional managers to the first in the cluster, osd1

ceph-deploy mgr create osd2 osd3

Storage Pools

A pool is the lowest level unit of data in Ceph. CephFS, RBD, and Swift are all ways to expose pools to different connectivity types.

Pool Type Fault Tolerance Storage Space
Replicated High Low
Erasure Coded Low High

When creating a pool, it’s important to pick an appropriate placement group identifier. Documentation on Placement Groups.

Example: Create a Replicated Pool

sudo ceph osd pool create reppool 50 50 replicated

Example: Create an Erasure Pool

The basic syntax replaces ‘replicated’ with ’erasure’ to specify the pool type.

sudo ceph osd pool create ecpool 50 50 erasure 

Pools can also be tuned balance redundancy and resiliency of the stored data. This is configured with the K and M values:

  • K = How many ‘chunks’ the original data will be divided into for storage. Generally, this is tied to the number of OSDs in the cluster.
  • M = Additional replica ‘chunks’ created to provide redundancy. The data is able to survive the failure up to M chunks.

For this very small cluster, we only need one replica chunk (M), and two primary chunks (K) to get the job done. This is done by creating a new profile (smallcluster), and then using that profile to provision a new storage pool.

sudo ceph osd erasure-code-profile set smallcluster \
    k=2 m=1 crush-failure-domain=host 

sudo ceph osd pool create ecpool2 50 50 erasure smallcluster

More information available from the official documentation.

Special thanks to Jake Grimmett for providing a correction for the original information here.

Create a CephFS pool

This will act as a cluster shared volume for the cluster running on the system.

  1. Create a pair of pools to store metadata and data for the cephfs cluster:

     ceph osd pool create cephfs_data 50 50 replicated
     ceph osd pool create cephfs_meta 50 50 replicated
    

Note, I will be using replicated pools because of the substantially lower chances of data loss. It also allows for more resilient pools as the number of OSDs grows.

Do not mix erasure and replicated pools when building CephFS subpools.

  1. Create a CephFS system from the two pools:

     ceph fs new cephfs cephfs_meta cephfs_data 
    
  • If using erasure coded pools:

      ceph osd pool set my_ec_pool allow_ec_overwrites true 
    

More information about CephFS available from the official documentation.

Mount the CephFS pool

Now that the pool is created, it can be mounted on each node.

  1. Install the cephfs-fuse package:

     sudo apt install ceph-fuse
    
  2. Create a mount point with the same name as the cephfs pool (not required but recommended)

     sudo mkdir -p /mnt/cephfs
    
  3. Configure the /etc/fstab file for using FUSE:

     none    /mnt/cephfs  fuse.ceph ceph.id=admin,_netdev,defaults  0 0
    

The cephfs kernel driver can also be used, but it is generally recommended to use FUSE instead.

After each node is ready, issue the command sudo mount -a on each and check the output of the df command. If you did everything right, you’ll see that /mnt/cephfs points to your shiny new cluster! Try adding files on one node and check from another to see if they’re there… If they’re not, troubleshoot.

Installing KVM

The next stage is installing the hypervisor on each node. Since linux 2.6 practically every system has KVM installed already, it’s just a matter of making sure it’s enabled and configuring it.

First, check if the CPU instructions are available:

egrep -c '(vmx|svm)' /proc/cpuinfo

If they are, install cpu-checker and test each node for KVM compatibility. If kvm-ok does not output KVM acceleration can be used you may have issues.

sudo apt install cpu-checker    
kvm-ok

If your systems are good, install the qemu and libvirtd libraries:

sudo apt install qemu-kvm libvirt-daemon-system libvirt-clients bridge-utils
sudo systemctl enable libvirtd
sudo systemctl start libvirtd

You will also have to add your user to the ’libvirt’ group on each system.

sudo adduser <your-user> libvirt

You will need to log out and back in for this to take effect

Configure the bridge network

Each node will have to be reconfigured to use the bridge interface for networking. This allows the guest VMs to share the network connections with the host system, as well as adding support for multiple VLANs and even virtual switches. If your machines has multiple NICs, you can also bond them to add network redundancy.

For this system, I am using the new Netplan.io method. If your system uses another network system such as ifupdown, you’ll need to configure it differently.

/etc/netplan/20-kvm-config.yml

network:
  version: 2
  ethernets:
    ens33:
      dhcp4: no
      dhcp6: no

  bridges:
    br0:
    interfaces: [ens33]
      addresses:
        - 10.20.10.31/24
      gateway4: 10.20.10.2
      nameservers:
        search:
          - intranet.mycooldomain.com
        addresses:
          - 10.20.10.11
          - 10.20.10.12

Test and apply the configuration.

sudo netplan generate
sudo netplan apply

Check the running network configuration:

networkctl status -a

After each node is reconfigured, check that ceph is replicating and that all nodes are still reachable. While the cluster can sustain one transient node failure, multiple simultaneous failures could cause issues.

Running a cloud image

Cloud images are small simplified versions of full linux systems. I like to use them for lightweight VMs and for testing systems.

First, create a directory structure on the CephFS share:

sudo mkdir -p /mnt/cephfs/{templates,virtualmachines,config}

Download the cloud image:

wget https://cloud-images.ubuntu.com/bionic/current/bionic-server-cloudimg-amd64.img
qemu-img info bionic-server-cloudimg-amd64.img 

Clone the cloud image

Convert the image to a copy-on-write image stored on the CephFS volume

sudo qemu-img convert -f qcow2 bionic-server-cloudimg-amd64.img /mnt/cephfs/templates/bionic-server-cloudimg-amd64.img

Because it is a copy-on-write image, it’s very fast to clone it and create a VM.

qemu-img create -f qcow2 -b /mnt/cephfs/templates/bionic-server-cloudimg-amd64.img /mnt/cephfs/virtualmachines/virt-01.img

Set a root password

sudo apt install libguestfs-tools
sudo virt-customize -a /mnt/cephfs/virtualmachines/virt-01.img --root-password password:hunter2

Please don’t actually set your root password to hunter2

Generate a cloud-config

Using the cloud-config toolkit, we can create a basic desired state for this VM. Obviously, we’re scratching the surface here. Cloud-config can do a lot more than just set a hostname and import an SSH identity!

/mnt/cephfs/config/your-server-name.yml

#cloud-config
password: not-your-password
chpasswd: { expire: False }
ssh_pwauth: True
hostname: virt-01
ssh_authorized_keys: 
  - ssh-rsa AAAAA_My_SSH_Public_key_here

Next, the cloud config image is packaged into another virtual disk image. This allows us to attach the cloud config file to a VM at boot time for it to self-configure during the provisioning stage.

sudo apt install cloud-image-utils
sudo cloud-localds /mnt/ceph/config/virt-01_cloudconfig.img /mnt/cephfs/config/virt-01_cloudconfig.yml

Creating and Running a Virtual Machine

Finally, we can run the VM!

virt-install --name virt-01 --memory 512 --vcpus 1 \
 --disk /mnt/cephfs/vms/virt-01.img,device=disk,bus=virtio \
 --disk /mnt/cephfs/config/virt-01_cloudconfig.img,device=cdrom \
 --os-type linux --os-variant ubuntu18.04 \
 --virt-type kvm --graphics none \
 --network network=default,model=virtio --import

And in about 30 seconds the virtual machine is up and running. You can escape the VM by typing ctrl+] at the login tty.

Virtual Machine Live Migration

One of the most important parts of virtualization is the ability to keep workloads up by live migrating them between hosts. Luckily, this is very easy on KVM systems. All it requires is that tcp/22 is open for SSH, and that keys and passwords are configured correctly.

Assuming the nodes are configured correctly, all that has to be done is run the migration command to move the VM to another node in the cluster:

virsh migrate --live virt-01 qemu+ssh://my-user-name@remote/system

Check the output of virsh list --all on both the source and destination virtual hosts. It should now be listed under the destination machine with the status of “running”

This is the magic of CephFS + KVM! Using off the shelf tech like SSH and Qemu, we are able to quickly and easily migrate production workloads in a much simpler way than systems like Hyper-V and ESXi.

Web UI

Another common feature of a virtualization system is the web ui. There are a few to choose from, but I think these are the top choices for a simple cluster:

That being said, Proxmox oversteps the role of web UI and attempts be a full system admin suite.

Taking it even further

Of course, virtual machines are so 2010s. Containerization is all the rage these days.

And that’s great! Kubernetes loves using CephFS for storage, and installing lxc or docker on the cluster is a logical next step. After this system is built out fully, that’s exactly what I’ll do. And really, that is the beauty of sticking the pieces together. Since we’re not at the mercy of Dell/EMC or Microsoft’s Technical Vision(TM) there’s really no restrictions on what technologies this cluster will be able to support.

As always, let me know what you think. I’m always curious what others have to say about this sort of technical projects.