惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Google Online Security Blog
Google Online Security Blog
C
CXSECURITY Database RSS Feed - CXSecurity.com
C
CERT Recently Published Vulnerability Notes
C
Cybersecurity and Infrastructure Security Agency CISA
Cisco Talos Blog
Cisco Talos Blog
Hacker News - Newest:
Hacker News - Newest: "LLM"
Scott Helme
Scott Helme
Project Zero
Project Zero
E
Exploit-DB.com RSS Feed
S
Secure Thoughts
K
Kaspersky official blog
L
Lohrmann on Cybersecurity
NISL@THU
NISL@THU
WordPress大学
WordPress大学
N
News and Events Feed by Topic
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
L
LINUX DO - 热门话题
小众软件
小众软件
P
Privacy & Cybersecurity Law Blog
博客园 - 聂微东
Google DeepMind News
Google DeepMind News
H
Hackread – Cybersecurity News, Data Breaches, AI and More
A
About on SuperTechFans
Hacker News: Ask HN
Hacker News: Ask HN
AWS News Blog
AWS News Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
H
Hacker News: Front Page
F
Full Disclosure
Latest news
Latest news
Schneier on Security
Schneier on Security
The Hacker News
The Hacker News
T
Troy Hunt's Blog
www.infosecurity-magazine.com
www.infosecurity-magazine.com
Jina AI
Jina AI
Martin Fowler
Martin Fowler
P
Proofpoint News Feed
TaoSecurity Blog
TaoSecurity Blog
G
GRAHAM CLULEY
Forbes - Security
Forbes - Security
V
V2EX - 技术
酷 壳 – CoolShell
酷 壳 – CoolShell
V
Vulnerabilities – Threatpost
C
Cyber Attacks, Cyber Crime and Cyber Security
MongoDB | Blog
MongoDB | Blog
博客园 - 三生石上(FineUI控件)
S
SegmentFault 最新的问题
Hugging Face - Blog
Hugging Face - Blog
P
Privacy International News Feed
C
Check Point Blog
N
News and Events Feed by Topic

博客园 - 加菲猫

网宿科技股份有限公司投资者关系活动记录表(2014.3.30) 网宿科技投资者关系活动记录2016年10月31日 [转载]20131206 网宿科技电话交流会纪要 strlcpy和strlcat 114 的 dns 的解析测试 分析ext2文件系统磁盘分区结构 [ZT] Linuxfs Readinglist wma/mp3等格式转换为apple有声电子书格式(m4b) 以及itunes导入码率设置 Progressive-download 对于文件格式的要求 CDN Origin Pull 牛项目 Harvest CDN设计:[笔记]Analysis of Enterprise Media Server Workloads CDN设计 - 层级化的cache_A Apple http live streaming 不支持windows? 关于pdf转doc (word) 的工具 - Solid Converter PDF Berkeley DB Hash、Btree、Queue、Recno 选择 一些校园招聘的题目和分析 6"电纸书/电子书 - PaperCrop pdf重排使用心得 A CAP Solution (Proving Brewer Wrong)
Scaling Redis
加菲猫 · 2011-06-12 · via 博客园 - 加菲猫

上佳好文,可惜被墙,怎能不转?

Scaling Redis

When a database is limited to running on a single computer, only certain load can be served with acceptable latency. Adding hardware to this single computer will help only so much. Doubling or tripling the load may require significantly more than twice or thrice cost of hardware to scale up. Such approach is expensive to scale and it eventually hits its limit. Ideally, we would start with a single inexpensive computer and as load increases we would keep on adding same inexpensive computers resulting in a near-linear function between load and cost. Such horizontal scaling out is common place in today’s web applications because it provides a more predictable cost model. 

Clusters of inexpensive commodity hardware lead to a disruption in the database ecosystem. What further amplifies this disruption in falling prices of RAM and solid state storage. A number NoSQL of databases are truly leveraging this disruption. At Meshin, one of our key requirements is the lightning-fast delivery of query results. Think of Google Instant search. Showing results with “as you type” latency enables a whole new class of use cases. This lead us to considering various in-memory database engines. Redis, with its simple and elegant data model and very transparent performance characteristics came out on top. For in-memory databases, single-threaded design becomes an obvious choice, essentially removing significant overhead common to traditional database architectures. Another assumption in Meshin design is that if the index is partially or completely lost it can be recovered by re-indexing original data. Sure, this introduces short-term inconsistencies, but on the other hand it allows to relax durability and further simplify the design. Such a trade-off is a perfect fit to the Redis persistence strategy.

Single-threaded design of Redis brings up another interesting issue. Scaling of a single instance of Redis is not limited just by the computer it runs on, which would amount to RAM size, but is further limited by the single core or hardware thread it will utilize. So if your load exceeds the single hardware thread or size of available RAM, you need a second instance of Redis. This requires some approach to clustering. Effective clustering is all about figuring out the right partitioning scheme. Ideally, you need to split load into identically-sized partitions. If load becomes biased towards one or more partitions you will have a bottleneck. Such a bottleneck will limit your system’s scalability or in other words will make your load-to-cost function non-linear. It is important that partitions are as independent as possible. If an operation spans a group of partitions its scalability will be limited by the number of such groups in the cluster. In extreme cases of spanning all partitions in a cluster, the scalability will be as good as running on a single partition. 

Another aspect is differentiating the load by reads and writes. If the Redis hardware thread capacity is exceeded by reads, it is easy to scale out by putting additional read-only replicas of the same data. It is much harder to scale out writes by replication where trading off consistency is often required. Redis again takes simple a approach with its master-slave replication. A writable master replica asynchronously updates one or more read-only slave replicas. Right after replying to a write operation the master notifies all replicas. No acknowledgment is required by the master. This means that there is short period of time when slave replicas may return old data. This provides with write-write consistency guarantees that are as strong as without replication. However, the write-read consistency guarantees are now less strong or “eventual”. It is important to note that replication not only helps scalability of reads, but also improves reliability when replicas are placed on different machines. For the Meshin application eventual write-read consistency is acceptable tradeoff for higher reliability.

With the Meshin application, we have a large number of user indexes, each of roughly equal size. Meshin keeps a sliding window index of email, Twitter, Facebook etc. messages to maintain predictable maximum size of index. This provides a great opportunity for partitioning. If a partition handles roughly an equal number of users, we have balanced storage and load requirements and at the same time made the most frequent operations directed at only one partition. One approach is choosing between N partitions is hashing a unique user identifier, for example an email address, such that the hash space is in the [0..N) range. If the quality of the hash function is sufficient, we will get a well-balanced distribution.

hash-fixed

Now we can place our Redis instances on M = N / K computer nodes in the cluster. K is number of Redis instances that we can comfortably run on single node. This number is limited by the number of cores or hardware threads. With a large number of hardware threads (8 and up) it makes sense to map each Redis instance to one thread. RAM is then equally divided between these instances. Now, knowing each user’s maximum required RAM, we can calculate how many users our cluster can accommodate. One thing to keep in mind is that if you use Redis snapshotting for persistence — which employs the OS copy-on-write mechanism — it may require as much as twice amount the RAM dedicated to a partition to take a snapshot.

It is important to note that in the case of the Meshin application with regard to specific RAM and thread count per computer the load is “memory-bound”. Meaning that a partition having a full memory pool sill won’t saturate the hardware thread with Redis. It is important to note that for some applications the load may become “thread-bound”. In this case the hardware thread is being saturated before the memory pool is maxed out. Interestingly, with the “memory-bound” case we have underutilized CPUs and with the “thread-bound” case we may have underutilized RAM.

Great so far, but what if we keep on adding users and they start exceeding the capacity of a partition? In the “memory-bound” case they exceed the memory pool allocated for a partition and in the “thread-bound” case they fully saturate the corresponding hardware thread. In our case, the number of partitions is fixed. Also, since partitions are equally populated with same size users they all about to overflow. As we add new nodes to the cluster, we may try to move some partitions from existing ones and rebalance such that our nodes have an equal number of partitions. There are two problems with this approach. In the “thread-bound” case this won’t help at all — we can’t really utilize new cores, since partitions are single threaded! In “memory-bound” case the machines will have fewer number of Redis instances than number of hardware threads. This leaves those threads idle and CPUs even more underutilized! One way to fix this problem is to overprovision partitions initially, so that K is a multiple of the number of hardware threads. The multiplier is essentially the number of times we want to grow our cluster by. Other approach is to allow changing number of partitions.

At Meshin we took approach where we are able to add partitions over time. This is accomplished by changing the way initial partitioning is done. Instead of limiting the hash function space to (0..N] we allow it to generate 64-bit unsigned integers. These hashes are mapped to a ring and wrapped from 2^64-1 to 0. We compared several hash functions homogeneity performance for a large set of email addresses as unique user identifiers. The best homogeneity was achieved by the SHA1 hash reduced to 64 bits by XOR-ing its parts.

hash-ring

Each partition is placed on the ring with equal distances in between. When a user identifier is hashed we get its position on the ring and thus can find the closest partition where user’s index resides by going clockwise.

hash-ring-distance

When new partitions are added they are placed in between existing partitions to maintain equal distances. We also have the flexibility to add partitions in multiple points on the ring if partitions are of variable size to enable more users’ indexes allocated in larger partitions. Now, to lookup an existing user’s index you may hit a newly added partition and miss the old partition where the index was originally allocated. One way to address this problem is to move the corresponding user indexes to new partitions or rebalance the whole hash ring. We decided to employ simple routing instead. If a partition doesn’t have the user index, it proceeds to the next partition in clockwise direction.

hash-ring-old-forward

Such routing is fairly scalable. It does not have a single point of contention since each partition participates in routing. We also use a caching layer on the application server to remember partition addresses for recent user identities. As with lookups, it is also possible to route new user index allocations. When a partition is full it will simply forward to next partition clockwise until a partition available for allocation is found. If new partitions’ points are added in between old ones this will keep routing paths short. 

hash-ring-full-forward

To add reliability and further scale reads we replicate each partition to a different node in the cluster. Replicas are placed so that each node has an equal number of masters and slaves. All writes are directed at the master replica, while reads are balanced between master and slave replicas. Balancing uses stickiness by user identity to keep CPU caches warm. For example, replicating N partitions twice, K partitions per node, 2 * K replicas per node and M nodes in cluster.

cluster

Partitioned replicas are taking snapshots asynchronously. We place the snapshots on redundant networked storage. All snapshot traffic goes through a dedicated network. Taking snapshots requires additional memory to be allocated by Redis. The amount of this memory depends on write activity on a partition and I/O latency of snapshot operation. To minimize snapshotting overhead on a specific node in the cluster we run only one snapshotting operation at the time. A dedicated agent calls the SAVE command for each partition on a node in a round-robin fashion. The SAVE command initiates snapshotting and returns when it is completed. The period of full cycle of snapshotting defines the maximum time slice of changes that could be lost if cluster a node fails.

Cluster configuration, including partition IP addresses and ports, positions on hash ring and health check status are stored in dedicated Redis instances. We call this dedicated instance the “directory”. The directory is replicated as well with the master doing the writes, which are quite rare. The directory reads are served by the master and one or more slave replicas in round-robin manner. When the application fails to access a certain partition replica, it marks it as offline in the directory. This means other instances will stop trying to access it since it’s been removed from the list of available replicas for a partition. If a failed replica was in slave role the cluster will remain fully functional, just with less redundancy for failed partition. We call this partial outage. In turn, if the replica was in master role, the partition will switch to read-only outage mode. Read-only outage mode still allows the partition to handle read queries. If all replicas in a partition have failed cluster is in full outage. Full outage is a rare event since it requires two nodes to fail simultaneously. A special agent monitors partitions for outages by polling the directory. If there is master-less partition it will elect one of the slave replicas to become master, reconfigure the rest of the replicas and adjust the directory to clear the read-only outage. While we attempt to recover from read-only outages automatically, partial outages and full outages must be addressed manually. At some point in scaling this will not work and we realize the need for a fully-automated recovery system down the road.

Currently the cluster runs on 14 nodes each with 96GB of RAM — this amounts a total of 1.3TB. The nodes have 16 hardware threads each. We deployed 160 partitions each replicated twice. Separate computers handle the directory and networked storage for snapshotting.

菊子曰 本文用菊子曰发布