Introduction to Distributed Databases: Taking the Domestic Database TDSQL as an Example

I. Introduction

Today, I'll share some enterprise-level internet technologies with you.

What I'm going to introduce is a distributed database. I'll try to explain its concept, products, and usage in simple terms, and I'll also provide links to learning materials for download at the end.

Distributed databases are arguably the most important type of database, as almost all the large-scale internet services you know run on them.

Normally, when we develop, we work with single-machine databases (also known as centralized databases), where the database runs on just one server.

(Image description: A single database server on the left supporting the entire application.)

A distributed database refers to a database system spread across multiple servers.

(Image description: A single database distributed across multiple servers, collectively supporting the application.)

At the macro level, important industries of the national economy such as finance, telecommunications, aviation, logistics, e-commerce, etc., cannot do without distributed databases.

Without it, it's hard to imagine what life would be like, for example, ticketing websites like 12306 wouldn't be able to provide services.

At the individual level, when you grow from a junior developer to an architect of large-scale projects, you will inevitably encounter distributed databases.

When designing architecture, unless you are only using one server, you can't avoid considering how to split and store data across multiple servers.

In short, after a product grows, distributed databases are unavoidable. For individuals, this also means progress in career and capabilities.

II. Advantages of Distributed Databases

Why are distributed databases so important? Because they have some advantages that single-machine databases cannot match.

(1) More secure。Distributed databases contain multiple nodes, whether placed in the same data center or different ones, they are much safer than single-machine databases.

(2) High Availability 。If a single database node fails, other nodes can still operate normally, avoiding single-point failure.

(3) Better Performance 。For big data and large computational tasks, distributed databases can process in parallel, significantly reducing processing time.

(4) Better Experience 。When the database is distributed across multiple data centers, it can allocate the nearest database node to the user, providing better response speed.

III. Difficulties of Distributed Databases

Although there are these advantages, the use of distributed databases is not widespread, and small companies generally do not use them. Why is that?

The main reason is, Distributed databases have two major issues that hinder their widespread adoption: high cost and complexity.

Distributed databases belong to the "multi-active geo-redundant" category, providing additional redundancy to ensure data security, and the cost is naturally high.

Its complexity is mainly reflected in the following points.

(1) Consistency issues . How to ensure data consistency across different nodes? What if the data on the nodes is inconsistent?

(2) Communication issues . How to ensure reliable communication between nodes? What if there is communication delay or failure?

(3) Partitioning issues . If splitting a large data table and storing the data on different nodes, the partitioning strategy and data migration between nodes can be very complex.

(4) Optimization issues。If data from multiple nodes needs to be combined, queries must be optimized to improve performance.

Four, CAP Theorem

As everyone may know, there is a famous CAP theorem that talks about the inherent limitations of distributed systems (including distributed databases).

Distributed systems have three major goals—data consistency (Consistency), high availability (Availability), and data partition tolerance (Partition tolerance).

The CAP theorem tells us that these three goals cannot all be satisfied simultaneously; at most, only two can be achieved at the same time . Under the premise of data partitioning, either for (strong) consistency, you must sacrifice high availability; or for high availability, you must sacrifice (strong) consistency.

Therefore, no distributed database can be perfect; it can only be a trade-off and balance of the three goals.

V. Products of Distributed Databases

The history of distributed databases is very long, with at least hundreds of products in the market, both open-source and closed-source.

Almost all distributed databases can be used either on a single machine (i.e., as a single-machine database) or in a multi-machine, distributed manner. Therefore, many of the single-machine databases we are familiar with are actually distributed databases.

Among open-source distributed databases, the most well-known are Postgres and MySQL (relational databases), as well as MongoDB and CockroachDB (non-relational databases).

In commercial databases, the most famous is Oracle. It is the de facto standard for distributed databases, and large enterprises generally choose to use it.

VI. Domestic Database TDSQL

Next, I will select the domestic database TDSQL As an example, introduce the functions and usage of distributed databases.

TDSQL is a product of Tencent, belonging to the leading distributed database in China. Almost all of Tencent's key businesses, such as WeChat, QQ, Tencent Music, Tencent Games, and more, run on it and have undergone high-intensity, massive real-world tests.

Many large companies outside are also using it, such as Xiaohongshu, Pinduoduo, Bilibili, Haier, Shenzhen Metro, and more.

It is completely built according to financial-grade standards, belonging to a financial-grade database, emphasizing security, high availability, and high concurrency , with over 500,000 customers currently. In the domestic financial industry, it serves 7 out of the TOP10 banks and has helped over 30 financial institutions with the core system transformation.

TDSQL is a completely domestic database, particularly emphasizing Oracle compatibility. Existing Oracle databases of enterprises can be smoothly migrated, and its cost is much lower than Oracle. If domestic enterprises have considerations for localization and supply chain security, it is a very good alternative.

Its product capabilities and independent R&D have passed national certification ("Announcement of Security and Reliability Evaluation Results of China Information Security Evaluation Center (No. 1, 2023)"), which is also one of the important considerations for technology selection by state-owned enterprises.

Finally, TDSQL is a publicly available service of Tencent Cloud, which anyone can use. With just a few clicks on the web page, it can be enabled, making it very easy to get started.

VII. Functions of Distributed Database

Let's take a look at the functions of distributed databases through TDSQL.

(1) Strong Synchronous Replication . Distributed databases often adopt a master-slave architecture, where a cluster has one master node (master) and several slave nodes (slave). The system supports strong synchronous replication between nodes to ensure data consistency.

Specifically, when writing data, the master node waits for the slave node to return an operation success message before returning the result to the user, ensuring that the data on the master node and the slave nodes is completely consistent.

(2) Transaction Consistency . The system provides each transaction with a globally unique numeric sequence, and each node can query the execution status of transactions, ensuring transaction consistency in a distributed environment.

(3) Automatic Sharding . Large data tables in distributed databases often need to be split and stored on different nodes. TDSQL supports automatic horizontal sharding (table splitting), evenly writing data to different nodes and automatically aggregating and returning results during queries.

For users, table splitting is transparent and can be completely ignored. The business side sees a logically complete table without needing to perceive the details of the backend sharding.

(4) Highly Scalable . When database performance or capacity is insufficient, TDSQL can scale without downtime.只需在控制台点击，就可自动升级完成。 System data migration, data balancing, and route switching are all automatic.

(5) Highly Flexible。Users can change table structures online; when encountering certain types of failures, the system can automatically recover; all nodes, whether they are master nodes or slave nodes, can perform read and write operations.

(6) Product Management Capabilities 。TDSQL is developer-friendly, providing a wealth of monitoring tools, real-time monitoring and alerts, and daily delivery of detailed health check reports.

Tencent Cloud has a dedicated cloud service DBbrain, which utilizes machine learning, big data, and expert experience engines to provide users' databases with functions such as performance, security, and management.

For example, it conducts comprehensive diagnostics and optimizations for SQL, identifies performance bottlenecks, and makes SQL, transactions, and business workflows fully observable. It visually presents exceptions such as deadlocks for easy understanding.

It has largely replaced manual DBAs, transforming traditional manual operations into intelligent services.

TDSQL Also has an AI intelligent Q&A system (see the image below). It is based on knowledge base and small model training, responding to user queries quickly and accurately, equivalent to an intelligent customer, providing professional and personalized answers.

Section 8: TDSQL Usage

Below, I will demonstrate the usage of TDSQL, it's very simple, after opening it on the web page, you can use the distributed database.

Step 1 , on the TDSQL official website , enter the product console.

Step 2 , on the console page, select the region where the database server is located (should be the same region as your cloud server), and then select the database engine, then click the "New" button.

Currently, TDSQL has three engines: MySQL, the self-developed TDStore, and PostgreSQL. Regardless of which engine, they all offer the same disaster recovery capabilities and high availability, and are compatible with Oracle.

Step 3 , a configuration page will pop up, allowing you to select the database configuration. Different configurations have different prices.

Among them, there is an option asking if you want to enable "Strong Synchronization".

Strong Synchronization ensures data consistency between the primary and secondary nodes. If your application does not require strong consistency and prioritizes fast response times, you can choose "Asynchronous".

Step 4 , after completing the configuration, you will proceed to the payment step. Once the payment is made, the database will be activated, and your distributed database will be online.

When using it, you need to connect to the database first, which includes both internal network and external network connections. Here, you canReference document. It should be noted that if an external network connection is enabled, the database is exposed to the public internet, and anyone can make requests. Security risks must be carefully considered.

After connecting to the database, you can execute SQL statements. At this stage, it's no different from using a regular database. The SQL of a distributed database is basically the same as that of a single-machine database.

Section 9: Best practices for TDSQL

There are some best practices for distributed data, and three are listed below (using the MySQL engine as an example).

(1) How to import data into a distributed database

This is divided into two cases. The first case is to import an existing single-machine instance into a newly created distributed instance. The steps are as follows (for detailed commands, see document)).

exports the table structure and data of a standalone database, obtaining two SQL files.

opens the database table structure file, setting the primary key (primary key) and shard key for each table.

uploads the modified two SQL files to the cloud server and imports them into the distributed database.

The second scenario is importing an existing distributed instance into another distributed instance. The operation steps are the same as above, except for the second step, which does not require specifying the primary key and shard key because they already exist. (Detailed commands see document ).

(2) How to shard

Sharding is one of the core issues of distributed databases: How many data partitions to set up? How to distribute data across multiple partitions?

The number of shards depends on the estimated maximum concurrency of the entire database and the number of requests each shard can handle. It can be calculated using the following formula.

Read-write concurrent performance = ∑(Shard performance * Number of shards)

The performance of a single shard is primarily related to the number of CPU / memory instances. The higher the specification of a single shard and the more shards there are, the stronger the processing capability of the database system.

Besides performance, sharding also needs to consider capacity issues. Generally speaking, a single shard should store at least 50 million rows of data.

(3) How to configure hardware

Hardware for distributed databases, the following are three recommended configurations.

A. Test function.

This scenario does not require performance, it is only used to verify the system. It is recommended to configure 2 nodes, with 2GB of memory + 25GB of hard disk per node.

B. Early stage of business development.

In this case, the data size is small and grows rapidly. It is recommended to configure 2 nodes, each with 16GB memory + 200GB hard drive.

C. Stable business development period.

In this case, the configuration should be based on the actual business situation. It can be configured with 4 nodes, with hardware for each node being: (current business peak * growth rate) / 4.

Ten, Summary

Overall, modern distributed database products hide a lot of their complexity, providing users with an easy-to-use interface.

Generally speaking, it is not recommended to build distributed databases yourself, even if you have dedicated database engineers and operations engineers, the cost will be very high. Using products from cloud service providers is a more economical and hassle-free choice.

Taking TDSQL as an example, it has two versions: the cluster version and the basic version. The former is multi-node and intended for enterprise use in production environments; the latter is single-node with lower costs and is specifically designed for personal use, but the functionality is the same, making it very suitable for individual developers to learn or try distributed databases.

(End)

福利内容

In this AI era, how can cloud services be used to assist in enterprise data management?

Below are three real-life cases from major domestic companies.

Case One: WeChat Read's "AI Ask Book" This feature allows AI to answer readers' questions about various issues related to the vast content of books.

Case Two: Straits Bank Core System Upgrade。How do provincial banks use TDSQL to upgrade their core systems to distributed databases.

Case Three: Architecture Optimization of the Aurora Big Data Platform 。Aurora (URORA) is a leading domestic developer service provider with nearly 100 PB of data, over a thousand nodes, and 400 million files. How should the architecture be optimized?

They come from the internal materials written by Tencent Cloud 《AGI Era's Preferred Full-Stack Data Management Solution》 , including tool guides, user case sharing, and much more.

Now you can download for free , just scan the QR code below with WeChat. If you are interested in enterprise-level development in real domestic environments, it's worth a look.

Recommended Feeds

阮一峰的网络日志