惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

F
Full Disclosure
WordPress大学
WordPress大学
小众软件
小众软件
Cloudbric
Cloudbric
AWS News Blog
AWS News Blog
腾讯CDC
量子位
人人都是产品经理
人人都是产品经理
大猫的无限游戏
大猫的无限游戏
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
V
Vulnerabilities – Threatpost
Scott Helme
Scott Helme
Hugging Face - Blog
Hugging Face - Blog
博客园_首页
C
CXSECURITY Database RSS Feed - CXSecurity.com
The Hacker News
The Hacker News
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
IT之家
IT之家
Jina AI
Jina AI
Attack and Defense Labs
Attack and Defense Labs
S
SegmentFault 最新的问题
Simon Willison's Weblog
Simon Willison's Weblog
The Cloudflare Blog
阮一峰的网络日志
阮一峰的网络日志
T
Tailwind CSS Blog
Last Week in AI
Last Week in AI
博客园 - 【当耐特】
Google Online Security Blog
Google Online Security Blog
美团技术团队
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
V
Visual Studio Blog
罗磊的独立博客
L
LINUX DO - 最新话题
博客园 - Franky
博客园 - 叶小钗
Apple Machine Learning Research
Apple Machine Learning Research
The Last Watchdog
The Last Watchdog
J
Java Code Geeks
AI
AI
C
Cisco Blogs
酷 壳 – CoolShell
酷 壳 – CoolShell
C
Cyber Attacks, Cyber Crime and Cyber Security
Cisco Talos Blog
Cisco Talos Blog
博客园 - 三生石上(FineUI控件)
雷峰网
雷峰网
Help Net Security
Help Net Security
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
云风的 BLOG
云风的 BLOG
I
Intezer
S
Securelist

evilsocket

Mongoose: Preauth RCE and mTLS Bypass on Millions of Devices TP-Link Tapo C200: Hardcoded Keys, Buffer Overflows and Privacy in the Era of AI Assisted Reverse Engineering How to Write an Agent Attacking UNIX Systems via CUPS, Part I Introducing Bettercap 2.4.0: CAN-Bus Hacking, WiFi Bruteforcing and Builtin Web UI Enumerate/Bruteforce/Attack All the Things! Presenting Legba Reverse Engineering the Apple MultiPeer Connectivity Framework Hide Your Servers in Plain Sight, Presenting ShieldWall Just Taking a Break :D Weaponizing and Gamifying AI for WiFi Hacking: Presenting Pwnagotchi 1.0.0 How to Create a Malware Detection System With Machine Learning Pwning WPA/WPA2 Networks With Bettercap and the PMKID Client-Less Attack Presenting Project Ergo: How to Build an Airplane Detector for Satellite Imagery With Deep Learning Project PITA: Build a Mini Mass Deauther Using Bettercap and a Raspberry Pi Zero W Go Is Amazing, So Here's What I Don't Like About It All Hail Bettercap 2.0, One Tool to Rule Them All. DIY Portable Secrets Manager With a Raspberry Pi Zero and ARC This Is Not a Post About BLE, Introducing BLEAH Hacking a Herb Vaporizer to Set Its Temperature Limit From 190C to 6553.5C Remotely
Process Behaviour Anomaly Detection Using eBPF and Unsupervised-Learning Autoencoders
Simone Margaritelli · 2022-08-15 · via evilsocket

Hello everybody, I hope you’ve been enjoying this summer after two years of Covid and lockdowns :D In this post I’m going to describe how to use eBPF syscall tracing in a creative way in order to detect process behaviour anomalies at runtime using an unsupervised learning model called autoencoder.

anomalies

While many projects approach this problem by building a list of allowed system calls and checking at runtime if the process is using anything outside of this list, we’ll use a methodology that will not only save us from explicitly compiling this list, but will also take into account how fast the process is using system calls that would normally be allowed but only within a certain range of usage per second. This techique can potentially detect process exploitation, denial-of-service and several other types of attacks.

You’ll find the complete source code on my Github as usual.

What is eBPF?

eBPF is a technology that allows to intercept several aspect of the Linux kernel runtime without using a kernel module. At its core eBPF is a virtual machine running inside the kernel that performs sanity checks on an eBPF program opcodes before loading it in order to ensure runtime safety.

From the eBPF.io page:

eBPF (which is no longer an acronym for anything) is a revolutionary technology with origins in the Linux kernel that can run sandboxed programs in a privileged context such as the operating system kernel. It is used to safely and efficiently extend the capabilities of the kernel without requiring to change kernel source code or load kernel modules.  

Historically, the operating system has always been an ideal place to implement observability, security, and networking functionality due to the kernel’s privileged ability to oversee and control the entire system. At the same time, an operating system kernel is hard to evolve due to its central role and high requirement towards stability and security. The rate of innovation at the operating system level has thus traditionally been lower compared to functionality implemented outside of the operating system.

ebpf

eBPF changes this formula fundamentally. By allowing to run sandboxed programs within the operating system, application developers can run eBPF programs to add additional capabilities to the operating system at runtime. The operating system then guarantees safety and execution efficiency as if natively compiled with the aid of a Just-In-Time (JIT) compiler and verification engine. This has led to a wave of eBPF-based projects covering a wide array of use cases, including next-generation networking, observability, and security functionality.

There are several options to compile into bytecode and then run eBPF programs, such as Cilium Golang eBPF package, Aya Rust crate and IOVisor Python BCC package and many more. BCC being the simplest is the one we’re going to use for this post. Keep in mind that the same exact things can be done with all these libraries and only runtime dependencies and performance would change.

System call Tracing with eBPF

The usual approach to trace system calls with eBPF consists in creating a tracepoint or a kprobe on each system call we want to intercept, somehow fetch the arguments of the call and then report each one individually to user space using either a perf buffer or a ring buffer. While this method is great to track each system call individually and check their arguments (for instance, checking which files are being accessed or which hosts the program is connecting to), it has a couple of issues.

First, reading the arguments for each syscall is quite tricky depending on the system architecture and kernel compilation flags. For instance in some cases it’s not possible to read the arguments while entering the syscall, but only once the syscall has been executed, by saving pointers from a kprobe and then reading them from a kretprobe. Another important issue is the eBPF buffers throughput: when the target process is executing a lot of system calls in a short period of time (think about an HTTP server under heavy stress, or a process performing a lot of I/O), events can be lost making this approach less than ideal.

Poor man’s Approach

kiss

Since we’re not interested in the system calls arguments, we’re going to use an alternative approach that doesn’t have the aforementioned issues. The main idea is very very simple: we’re going to have a single tracepoint on the sys_enter event, triggered every time any system call is executed. Instead of immediately reporting the call to userspace via a buffer, we’re only going to increment the relative integer slot in an array, creating an histogram.

This array is 512 integers long (512 set as a constant maximum number of system calls), so that after (for instance) system call read (number 0) is executed twice and mprotect (number 10) once, we’ll have a vector/histogram that’ll look like this:

2,0,0,0,0,0,0,0,0,0,1,0,0,0,..........

The relative eBPF is very simple and looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

BPF_PERCPU_ARRAY(histogram, u32, MAX_SYSCALLS);


TRACEPOINT_PROBE(raw_syscalls, sys_enter)
{

u64 pid = bpf_get_current_pid_tgid() >> 32;
if(pid != TARGET_PID) {
return 0;
}


u32 key = (u32)args->id;
u32 value = 0, *pval = NULL;
pval = histogram.lookup_or_try_init(&key, &value);
if(pval) {
*pval += 1;
}

return 0;
}

So far no transfer of data to user space is performed, so no system call invocation is lost and everything is accounted for in this histogram.

We’ll then perform a simple polling of this vector from userspace every 100 milliseconds and, by comparing the vector to its previous state, we’ll calculate the rate of change for every system call:

1
2
3
4
5
6
7
8
9
10
11
12
13

while 1:

histogram = [histo_map[s] for s in range(0, MAX_SYSCALLS)]

if histogram != prev:

deltas = [ 1.0 - (prev[s] / histogram[s]) if histogram[s] != 0.0 else 0.0 for s in range(0, MAX_SYSCALLS)]
prev = histogram



time.sleep(args.time / 1000.0)

This will not only take into account which system calls are executed (and the ones that are not executed, thus having counter always to 0), but also how fast they are executed during normal activity in a given amount of time.

Once we have this data saved to a CSV file, we can then train a model that’ll be able to detect anomalies at runtime.

An autoencoder is an artificial neural network used in unsupervised learning tasks, able to create an internal representation of unlabeled data (therefore the “unsupervised”) and produce an output of the same size. This approach can be used for data compression (as the internal encoding layer is usually smaller than the input) and of course anomaly detection like in our case.

autoencoder

Source: https://lilianweng.github.io/posts/2018-08-12-vae/

The main idea is to train the model and using our CSV dataset both as the input to the network and as its desired output. This way the ANN will learn what is “normal” in the dataset by correctly reconstructing each vector. When the output vector is substantially different from the input vector, we will know this is an anomaly because the ANN was not trained to reconstruct this specific one, meaning it was outside of what we consider normal activity.

Our autoencoder has 512 inputs (defined as the MAX_SYSCALLS constant) and the same number of outputs, while the internal representation layer is half that size:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
n_inputs = MAX_SYSCALLS


inp = Input(shape=(n_inputs,))

encoder = Dense(n_inputs)(inp)
encoder = ReLU()(encoder)

middle = Dense(int(n_inputs / 2))(encoder)

decoder = Dense(n_inputs)(middle)
decoder = ReLU()(decoder)
decoder = Dense(n_inputs, activation='linear')(decoder)
m = Model(inp, decoder)


m.compile(optimizer='adam', loss='mse')

For training our CSV dataset is split in training data and testing/validation data. After training the latter is used to compute the maximum reconstruction error the model presents for “normal” data:

1
2
3
4
5
6
7
8
9
10
11

y_test = model.predict(test)
test_err = []

for ind in range(len(test)):

abs_err = np.abs(test[ind, :]-y_test[ind, :])

test_err.append(abs_err.sum())

threshold = max(test_err)

We now have an autoencoder and its reference error threshold that we can use to perform live anomaly detection.

Example

Let’s see the program in action. For this example I decided to monitor the Spotify process on Linux. Due to its high I/O intensity Spotify represents a nice candidate for a demo of this approach. I captured training data while streaming some music and clicking around playlists and settings. One thing I did not do during the learning stage is clicking on the Connect with Facebook button, this will be our test. Since this action triggers system calls that are not usually executed by Spotify, we can use it to check if our model is actually detecting anomalies at runtime.

Learning from a live process

Let’s say that Spotify has process id 1234, we’ll start by capturing some live data while using it:

1
sudo ./main.py --pid 1234 --data spotify.csv --learn

Keep this running for as much as you can, having the biggest amount of samples possible is key in order for our model to be accurate in detecting anomalies. Once you’re happy with the amount of samples, you can stop the learning step by pressing Ctrl+C.

Your spotify.csv dataset is now ready to be used for training.

Training the model

We’ll now train the model for 200 epochs, you will see the validation loss (the mean square error of the reconstructed vector) decreasing at each step, indicating that the model is indeed learning from the data:

1
./main.py --data spotify.csv --epochs 200 --model spotify.h5 --train

After the training is completed, the model will be saved to the spotify.h5 file and the reference error threshold will be printed on screen:

...
Epoch 195/200
60/60 [==============================] - 0s 2ms/step - loss: 1.3071e-05 - val_loss: 6.3671e-05
Epoch 196/200
60/60 [==============================] - 0s 2ms/step - loss: 1.8221e-05 - val_loss: 5.2383e-05
Epoch 197/200
60/60 [==============================] - 0s 2ms/step - loss: 9.2132e-06 - val_loss: 5.3354e-05
Epoch 198/200
60/60 [==============================] - 0s 2ms/step - loss: 9.2722e-06 - val_loss: 4.9380e-05
Epoch 199/200
60/60 [==============================] - 0s 2ms/step - loss: 8.0692e-06 - val_loss: 5.1954e-05
Epoch 200/200
60/60 [==============================] - 0s 2ms/step - loss: 8.3448e-06 - val_loss: 5.0102e-05
model saved to spotify.h5, getting error threshold for 106 samples ...

error threshold=9.969912

Detecting anomalies

Once the model has been trained it can be used on the live target process to detect anomalies, in this case we’re using a 10.0 error threshold:

1
sudo ./main.py --pid 1234 --model spotify.h5 --max-error 10.0 --run

When an anomaly is detected the cumulative error will be printed along wiht the top 3 anomalous system calls and their respective error.

In this example, I’m clicking on the Connect with Facebook button that will use system calls such as getpriority that were previsouly unseen in training data.

We can see from the output that the model is indeed detecting anomalies:

1
2
3
4
error = 30.605255 - max = 10.000000 - top 3:
b'getpriority' = 0.994272
b'writev' = 0.987554
b'creat' = 0.969955

success

Conclusions

This post shows how by using a relatively simple approach and giving up some of the system call speficics (the arguments) we can overcome performance issues and still be able to capture enough information to perform anomaly detection. As previously said this approach works for several scenarios, from simple anomalous behaviour due to bugs, to denial of service attacks, bruteforcing and exploitation of the target process.

The overall performance of the system could be improved by using native libraries such as Aya and its accuracy with some hyper parameters tuning of the model along with more granular per-feature error thresholds.

All these things are left as an exercise for the reader :D

"Left as an exercise for the reader" closing meme