libcopp v2的第一波优化完成

I'm OWenT

国产大模型（GLM 5.1、Kimi K2.6）真实场景效果和 Coding Plan 额度测试新版本libatapp的连接管理——从etcd服务发现到拓扑驱动的自动重连新版本libatbus的设计变更——从树形路由到拓扑驱动 Protobuf又一坑 - C++标准和ABI兼容性 AI真好用-给Blog主题统一加mermaid,chart.js,excalidraw,draw.io的多种引入方式支持给内网部署Squid-通用HTTP下载缓存 UE使用CodeChecker和clang-tidy生成静态分析报告找出UE的循环依赖 C++小协程栈和临时变量及作用域的栈溢出问题分析游戏服务的可观测性能力建设（C++生态）指标上报的多线程优化和多拉取源点优化协程(libcopp)的Channel功能和CPU命中率优化通用RPC代码生成器实现strong_rc_ptr(比shared_ptr更快的引用计数智能指针) 手夯一个STL allocator和对象内存分析组件 std::condition_variable 的信号丢失问题踩坑一处（GCC）STL std::async 实现BUG导致的crash问题 GCC 14的一个warning to error BUG 给xresloader（Excel导表工具）增强UE读表支持（包含蓝图,Blueprint） Opentelemetry社区在gRPC的几个链接问题(静态库和动态库混用,musl工具链,符号裁剪) Excel转表工具(xresloader)的新验证器（验证外部Excel和文本数据，唯一性和自定义规则） protobuf v22和gRPC v1.55版本升级的依赖变化和upb适配关于protobuf近期版本（v20/v3.20+）和 gRPC v1.54版本在某些编译环境下的一些链接和编译问题 xresloader-Excel导表工具链的近期变更汇总打通游戏服务端框架的C++20协程改造的最后一环 Opentelemetry-cpp的Logs模块标准更新(涉及近期版本:1.8-1.9的BREAK CHANGES) 给cmake-toolset和工具链(curl等)加HTTP/2和HTTP/3支持又开新坑之 coredns 插件: nftables和filter 关于opentelemetry-cpp社区对于C++ Head Only组件单例和符号可见性的讨论小记填个转表工具 xresloader 去年的坑（数组尾部裁剪）集成 upb 和 lua binding 的踩坑小记 libcopp对C++20协程的接入和接口设计再度优化GCC、LLVM、Clang、libc++、libc++abi等套件的构建脚本游戏服务的分布式事务优化（二）- 事务管理游戏服务的分布式事务优化（一）- Write Ahead Log(WAL) 模块记录一些bazel适配用编译选项测试现代化硬件C++浮点数性能和一致性适配Boringssl和OpenSSL 3.0 近期cmake-toolset的一些适配问题 C++20 Text Formatting/fmtlib 适配问题小记再次重构LLVM+Clang+libcxx+libc++abi+其他相关工具的构建流程重构基于CMake的构建工具链新版GCC和LLVM+Clang终于Release啦折腾一下nftables下的双拨 [C++20] Module partitions和符号交叉引用（声明和实现分离） [Rust] 实现一个线程安全且迭代器可以保存的链表基于protobuf的代码生成几个使用protobuf中C++接口的Arena的坑 Amazon Aurora DB存储引擎论文阅读小记近期对libatapp的一些优化调整(增加服务发现和连接管理,支持yaml等) xresloader转表工具链增加了一些新功能(map,oneof支持，输出矩阵，基于模板引擎的加载代码生成等) 在游戏服务器中使用分布式事务 libcopp接入C++20 Coroutine和一些过渡期的设计 libatbus 的大幅优化 nftables初体验容器配置开发环境小计 PALM Tree - 适合多核并发架构的B+树 - 论文阅读小记跨平台协程库 - libcopp 简介 C++20 Coroutine 性能测试 (附带和libcopp/libco/libgo/goroutine/linux ucontext对比) 尝鲜Github Action 一些xresloader（转表工具）的改进 protobuf、flatbuffer、msgpack 针对小数据包的简单对比协程框架(libcopp) 小幅优化 Excel转表工具(xresloader) 增加protobuf插件功能和集成 UnrealEngine 支持 Anna（支持任意扩展和超高性能的KV数据库系统）阅读笔记 C++20 Coroutine libcopp merge boost.context 1.69.0 Google去中心化分布式系统论文三件套(Percolator、Spanner、F1)读后感 Rust玩具-企业微信机器人通用服务使用ELK辅助监控开发测试环境服务质量和问题定位 2018年的新通用伪随机数算法(xoshiro / xoroshiro)的C++(head only)实现 Webpack+vue+boostrap+ejs构建Web版GM工具 Rust的第二次接触-写个小服务器程序理解和适配AEAD加密套件 atsf4g-co的进化：协程框架v2、对象路由系统和一些其他细节优化协程框架(libcopp)v2优化、自适应栈池和同类库的Benchmark对比可执行文件压缩初识Rust 使用restructedtext编写xresloader文档 atframework的etcd模块化重构 C++的backtrace ECDH椭圆双曲线（比DH快10倍的密钥交换）算法简介和封装 protobuf-net的动态Message实现 pbc的proto3接入 atgateway内置协议流程优化-加密、算法协商和ECDH 整理一波软件源镜像同步工具+DevOps工具 Blog切换到Hugo libcopp(v2) vs goroutine性能测试 libcopp的线程安全、栈池和merge boost.context 1.64.0 GCC 7和LLVM+Clang+libc++abi 4.0的构建脚本 libatbus的几个藏得很深的bug 用cmake交叉编译到iOS和Android 开源项目得一些小维护 atapp的c binding和c#适配对象路由系统设计 2016年总结近期的一个协程流程BUG 重写了llvm+clang+libc++和libc++abi的构建脚本 atsf4g完整游戏工程示例|I'm OWenT atframework基本框架已经完成|I'm OWenT

libcopp v2的第一波优化完成

2017-07-01 · via I'm OWenT

blog-website

之前测出来libcopp还有一些列优化点，但是要破坏之前的API，所以整理了一下优化的想法和方案。

预留空间和合并分配

之前有太多的堆内存分配了，导致很多碎片。那么第一个想法就是协程对象可以分配在栈上，runner也可以分配在栈上。然后还可以加一个自定义预留长度。每个对象对齐到sizeof(long)，总长度对齐到64 Bytes。

然后暴露出一个很小的足够编译优化时放进寄存器的对象，直接拷贝这个对象来传递协程（类似boost.context的continuation）。

原来栈分配和协程对象分配是分开的，现在必须合并了。并且创建时可以指定分配多少预留空间。栈空间回收的操作必须切只能在析构结束最后执行，也就是智能指针的析构需要自己定义。

仍然保留cotask和copp两个部分，但是cotask分配在copp的预留空间中，同样，action也分配在那里。结构如下

|......call stack(执行栈)........Xcotask(协程任务)...action(协程任务可执行对象)...copp(协程上下文对象)...private buffer(私有数据内存块)...随机偏移|
                                ^
                       执行栈起始地址从这里开始

所有数据都对齐到sizeof(std::max_align_t)或sizeof(long double)，即64位系统下对齐到8/16字节。

侵入式智能指针

原来很多地方用得std::shared_ptr，这些也会导致额外的内存分配。这个目的本来是为了线程安全和生命周期管理，可以用make_shared来合并对象和管理区。如果我把协程对象分配在栈上，那就不能用make_shared了，那么内存碎片也就又出现了。所以我想再引入侵入式智能指针来解决这个问题。同时还可以选择要不要考虑线程安全，如果可以不考虑线程安全就可以节省掉原子操作的L1 cache miss的开销。

关于缓存命中率和return stack buffer问题

我和call_in_stack的作者聊了聊，有不少收获。但是他使用的避免return stack buffer失效方法我感觉有点暴力了，直接重定义掉了ret的汇编指令。这种怕是跨架构并且和后面各种编译器行为和编译优化相关性都很强。虽然说效果很明显，但是我觉得还是要放弃这种实现。不过关于缓存命中率的问题倒是可以加入一个随机长度的栈起始offset来解决。现在很多CPU的L1 cache都是32KB，然后4-8路。所以保险点的话大约有64-128ways吧。反正如果是64的话，64和0是一样的，所以可以offset从(0-127) « 3，然后就可以命中在不同的cache line上了。

但是后来我测试的时候发现，其实在第一项优化之后，由于不同的cotask（包括其内部的协程上下文和task runner）的地址间隔比较远。导致L1 Cache Read Miss增高，反而实际切换开销增大了。但是实际使用过程中，协程内部的逻辑应该会更容易导致切换时的L1 Cache Read Miss，所以这个数值应该更具有参考意义。在这种情况下，那个带offset的减少cache miss的优化根本上没有效果，所以后来又去除了。

减少原子操作

之前写libcopp merge boost 1.64和相关优化的blog里有提到这一点。主要是原子操作会导致cache miss。原来由于是用智能指针来维护生命周期，很容易传递这个协程的时候发生原子操作，另外就是每次切换上下文前，由于要检查状态，都避免不了一次load、一次CAS和一次restore。然后cotask要做一次这个操作，copp也要做一次。然后在当时的测试里，使用的是栈池，而栈池为了保证线程安全，又有一次splin lock的加锁和解锁。这导致cotask的开销比copp多了40+ns。而精简的情况下，协程操作才70ns左右。这就增加了非常多了。

所以减少原子操作，一方面是要简化cotask和copp里的两次，可以优化成一次。另外可以开启一个非线程安全选项，这时候吧原子整数换成普通整数。因为我们游戏进程里都是单线程的服务，这种开销也不需要。

右值引用

本来大部分逻辑是可以用简单的方法，无视掉低量的复制消耗的。但是随着原子操作的消耗增加，我们就不得不注意类似智能指针复制导致的这种原子开销。所以就得更加智能地判定编译器并且使用右值，因为使用右值转移智能指针的话，不涉及计数器的增减，所以也就没有这个开销。

其他方面使用右值倒不是特别重要，因为传递的东西几乎都是只有几个指针的长度，很容易在编译优化的时候放进寄存器。这样就很快了。

新的单元测试

结构的变化必然涉及新的单元测试用例，我稍微列举了下，记在这里以便后续一个一个地实现。

private buffer边界和内容测试
支持区分task_action_impl和非task_action_impl的仿函数，并且测试on_finished接口的是否被正确调用

API调整

移除了不含参数的callback支持，现在所有的托管回掉的形式都是 int (void*)。（以前接入新模型后向前兼容地保留了int ()的接口，这次涉及接口调整，所以一起移除了这个历史包袱）
不再允许把协程对象声明在栈上，现在必须分配在栈上，并且使用侵入式智能指针
协程任务系统使用侵入式智能指针，不再使用shared_ptr
不再支持共享task action，实际应用中没什么作用
允许预定义分配一段私有数据空间，保存用户数据

初步成果

目前v2版本的基本改造和流程已经实现完毕，CI上的结果如下:

###################### context coroutine (stack using stack pool) ###################
########## Cmd: ./sample_benchmark_coroutine_stack_pool 1000 1000 2048
### Round: 1 ###
allocate 1000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
create 1000 coroutine, cost time: 0 s, clock time: 5 ms, avg: 5000 ns
switch 1000 coroutine contest 1000000 times, cost time: 0 s, clock time: 83 ms, avg: 83 ns
remove 1000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 2 ###
allocate 1000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
create 1000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1000 coroutine contest 1000000 times, cost time: 0 s, clock time: 82 ms, avg: 82 ns
remove 1000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 3 ###
allocate 1000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
create 1000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1000 coroutine contest 1000000 times, cost time: 0 s, clock time: 83 ms, avg: 83 ns
remove 1000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 4 ###
allocate 1000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
create 1000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1000 coroutine contest 1000000 times, cost time: 0 s, clock time: 82 ms, avg: 82 ns
remove 1000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 5 ###
allocate 1000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
create 1000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1000 coroutine contest 1000000 times, cost time: 0 s, clock time: 83 ms, avg: 83 ns
remove 1000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns


###################### context coroutine (stack using stack pool) ###################
########## Cmd: ./sample_benchmark_coroutine_stack_pool 1 3000000 16
### Round: 1 ###
allocate 1 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
create 1 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1 coroutine contest 3000000 times, cost time: 0 s, clock time: 162 ms, avg: 54 ns
remove 1 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 2 ###
allocate 1 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
create 1 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1 coroutine contest 3000000 times, cost time: 1 s, clock time: 162 ms, avg: 54 ns
remove 1 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 3 ###
allocate 1 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
create 1 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1 coroutine contest 3000000 times, cost time: 0 s, clock time: 162 ms, avg: 54 ns
remove 1 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 4 ###
allocate 1 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
create 1 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1 coroutine contest 3000000 times, cost time: 0 s, clock time: 163 ms, avg: 54 ns
remove 1 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 5 ###
allocate 1 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
create 1 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1 coroutine contest 3000000 times, cost time: 0 s, clock time: 163 ms, avg: 54 ns
remove 1 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns


###################### context coroutine (stack using stack pool) ###################
########## Cmd: ./sample_benchmark_coroutine_stack_pool 30000 100 64
### Round: 1 ###
allocate 30000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
create 30000 coroutine, cost time: 0 s, clock time: 145 ms, avg: 4833 ns
switch 30000 coroutine contest 3000000 times, cost time: 1 s, clock time: 771 ms, avg: 257 ns
remove 30000 coroutine, cost time: 0 s, clock time: 6 ms, avg: 200 ns
### Round: 2 ###
allocate 30000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
create 30000 coroutine, cost time: 0 s, clock time: 10 ms, avg: 333 ns
switch 30000 coroutine contest 3000000 times, cost time: 1 s, clock time: 812 ms, avg: 270 ns
remove 30000 coroutine, cost time: 0 s, clock time: 4 ms, avg: 133 ns
### Round: 3 ###
allocate 30000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
create 30000 coroutine, cost time: 0 s, clock time: 9 ms, avg: 300 ns
switch 30000 coroutine contest 3000000 times, cost time: 1 s, clock time: 793 ms, avg: 264 ns
remove 30000 coroutine, cost time: 0 s, clock time: 4 ms, avg: 133 ns
### Round: 4 ###
allocate 30000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
create 30000 coroutine, cost time: 0 s, clock time: 9 ms, avg: 300 ns
switch 30000 coroutine contest 3000000 times, cost time: 0 s, clock time: 787 ms, avg: 262 ns
remove 30000 coroutine, cost time: 0 s, clock time: 5 ms, avg: 166 ns
### Round: 5 ###
allocate 30000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
create 30000 coroutine, cost time: 0 s, clock time: 9 ms, avg: 300 ns
switch 30000 coroutine contest 3000000 times, cost time: 1 s, clock time: 804 ms, avg: 268 ns
remove 30000 coroutine, cost time: 0 s, clock time: 4 ms, avg: 133 ns


###################### task (stack using stack pool) ###################
########## Cmd: ./sample_benchmark_task_stack_pool 1000 1000 2048
### Round: 1 ###
create 1000 task, cost time: 0 s, clock time: 5 ms, avg: 5000 ns
switch 1000 tasks 1000000 times, cost time: 0 s, clock time: 160 ms, avg: 160 ns
remove 1000 tasks, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 2 ###
create 1000 task, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1000 tasks 1000000 times, cost time: 0 s, clock time: 160 ms, avg: 160 ns
remove 1000 tasks, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 3 ###
create 1000 task, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1000 tasks 1000000 times, cost time: 0 s, clock time: 160 ms, avg: 160 ns
remove 1000 tasks, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 4 ###
create 1000 task, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1000 tasks 1000000 times, cost time: 0 s, clock time: 159 ms, avg: 159 ns
remove 1000 tasks, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 5 ###
create 1000 task, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1000 tasks 1000000 times, cost time: 0 s, clock time: 160 ms, avg: 160 ns
remove 1000 tasks, cost time: 0 s, clock time: 0 ms, avg: 0 ns


###################### task (stack using stack pool) ###################
########## Cmd: ./sample_benchmark_task_stack_pool 1 3000000 16
### Round: 1 ###
create 1 task, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1 tasks 3000000 times, cost time: 1 s, clock time: 351 ms, avg: 117 ns
remove 1 tasks, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 2 ###
create 1 task, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1 tasks 3000000 times, cost time: 0 s, clock time: 347 ms, avg: 115 ns
remove 1 tasks, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 3 ###
create 1 task, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1 tasks 3000000 times, cost time: 0 s, clock time: 346 ms, avg: 115 ns
remove 1 tasks, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 4 ###
create 1 task, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1 tasks 3000000 times, cost time: 1 s, clock time: 346 ms, avg: 115 ns
remove 1 tasks, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 5 ###
create 1 task, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1 tasks 3000000 times, cost time: 0 s, clock time: 346 ms, avg: 115 ns
remove 1 tasks, cost time: 0 s, clock time: 0 ms, avg: 0 ns


###################### task (stack using stack pool) ###################
########## Cmd: ./sample_benchmark_task_stack_pool 30000 100 64
### Round: 1 ###
create 30000 task, cost time: 1 s, clock time: 150 ms, avg: 5000 ns
switch 30000 tasks 3000000 times, cost time: 1 s, clock time: 1407 ms, avg: 469 ns
remove 30000 tasks, cost time: 0 s, clock time: 13 ms, avg: 433 ns
### Round: 2 ###
create 30000 task, cost time: 0 s, clock time: 14 ms, avg: 466 ns
switch 30000 tasks 3000000 times, cost time: 1 s, clock time: 1404 ms, avg: 468 ns
remove 30000 tasks, cost time: 0 s, clock time: 13 ms, avg: 433 ns
### Round: 3 ###
create 30000 task, cost time: 0 s, clock time: 14 ms, avg: 466 ns
switch 30000 tasks 3000000 times, cost time: 2 s, clock time: 1420 ms, avg: 473 ns
remove 30000 tasks, cost time: 0 s, clock time: 14 ms, avg: 466 ns
### Round: 4 ###
create 30000 task, cost time: 0 s, clock time: 16 ms, avg: 533 ns
switch 30000 tasks 3000000 times, cost time: 1 s, clock time: 1419 ms, avg: 473 ns
remove 30000 tasks, cost time: 0 s, clock time: 13 ms, avg: 433 ns
### Round: 5 ###
create 30000 task, cost time: 0 s, clock time: 14 ms, avg: 466 ns
switch 30000 tasks 3000000 times, cost time: 2 s, clock time: 1389 ms, avg: 463 ns
remove 30000 tasks, cost time: 0 s, clock time: 13 ms, avg: 433 ns

这是CI里Linux上的结果，osx和windows上的结果就不贴了，都没linux上的数据好，不过相差不大。

这一批优化过后，其实切换性能并没有提高，反而下降了，我查了一下是原先对read的L1 cache命中率也比较高，但是现在下降了。不过我认为这种miss的情况可能更能说明问题，因为现在是几乎0逻辑，一旦有逻辑之后也容易增加cache miss。究其原因，应该是新的组织方式把一个协程（特别是task）的维护数据放在了栈上，那么切换的时候切换地址就会比较遥远。而之前的方式，只要多个协程对象靠得近，而协程对象又足够小（task对象里的数据比较大，但是coroutine里的比较小），因为只有压测代码，所以new的时候malloc一般会把类似大小的对象放一起，这也增加了命中率。但是实际应用的话，必然中间会插入很多其他对象的，所以这个情况也会不贴近实际场景。但是另一方面，v2的优化也使得在使用栈池的情况下，分配开销被大幅降低了，我们的项目里由于开始需要大量的定时器，定时器回调的时候也需要在协程内进行，这也给大量定时器的触发的时候提供了更好的支持和保障。

不过我认为现在的v2任然有可以再优化的地方，本来想在优化完一波再放出这篇总结，但是一方面还没找到什么特别好的不破坏设计的点。另一方面最近看了下SSR-libev的源码，觉得他那种方式的适配的扩展性更好，所以后面可能回去优化服务器框架asf4g-co的网关层部分，可能要有比较大的改动，改变密钥协商流程，让atgateway支持协商加密方式、更多的加密算法和支持椭圆双曲线协商。所以libcopp的优化暂时搁置吧，不过以后会主要以v2版本的api为准，基本不会再变更接口了吧。

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

I'm OWenT

blog-website

预留空间和合并分配

侵入式智能指针

关于缓存命中率和return stack buffer问题

减少原子操作

右值引用

新的单元测试

API调整

初步成果