
























脑裂定义: 在分布式系统中,因网络分区导致集群被分割成多个子集,每个子集都认为自己是唯一正确的集群,从而出现多个"大脑"同时决策的问题。
CAP定理体现:
网络分区(P)发生时:
- 选择一致性(C):停止服务,等待网络恢复
- 选择可用性(A):继续服务,可能产生数据不一致
正常状态:
Primary NameNode (Active) ←→ Standby NameNode (Standby)
网络分区后:
分区A: Primary NameNode (认为自己是Active)
分区B: Standby NameNode (升级为Active)
结果: 两个Active NameNode同时提供服务
正常状态:
ResourceManager1 (Active) ←→ ResourceManager2 (Standby)
脑裂状态:
分区A: RM1继续调度资源,接受新任务
分区B: RM2升级为Active,也开始调度资源
结果: 集群资源被重复分配,任务冲突
常见原因:
节点A → 心跳超时 → 节点B (认为A已死亡)
节点B → 心跳超时 → 节点A (认为B已死亡)
实际情况: 两个节点都正常,只是网络不通
超时检测的局限性:
NameNode脑裂 → DataNode注册混乱 → 副本状态不一致
ResourceManager脑裂 → NodeManager重复注册 → 资源分配冲突
Hadoop 3.2.1的QJM改进:
新增特性:
- Observer NameNode支持:只读副本分担读请求
- Router-based Federation:解决单NameNode瓶颈
- Consistent Reads:确保Observer读取一致性
Observer模式架构:
Active NameNode ←→ JournalNode1
Standby NameNode ←→ JournalNode2
Observer NameNode ←→ JournalNode3 (只读同步)
防脑裂增强:
xml
<!-- hdfs-site.xml for Hadoop 3.2.1 -->
<configuration>
<!-- Observer NameNode配置 -->
<property>
<name>dfs.namenode.state.context.enabled</name>
<value>true</value>
</property>
<!-- 增强的fencing超时 -->
<property>
<name>dfs.ha.fencing.ssh.connect-timeout</name>
<value>30000</value>
</property>
<!-- 新增的健康检查 -->
<property>
<name>dfs.namenode.lifeline.rpc-address.EXAMPLENAMESERVICE.nn1</name>
<value>nn1-host:8020</value>
</property>
</configuration>
Hadoop 3.2.1的ZKFC增强:
java
// 新增优雅降级机制
public class DFSZKFailoverController extends ZKFailoverController {
// 增强的健康检查
@Override
protected boolean isHealthy() {
// 检查NameNode RPC响应
// 检查JournalNode连接状态
// 检查内存使用情况
return super.isHealthy() && checkJournalNodeConnectivity();
}
// 优雅的状态转换
@Override
protected void gracefulFailover() {
// 等待正在进行的操作完成
// 清理临时状态
// 通知客户端连接切换
}
}
Shell命令增强:
bash
# Hadoop 3.2.1支持的新fencing方法
# 1. 网络隔离fencing
dfs.ha.fencing.methods=shell(/usr/local/bin/network-fence.sh)
# network-fence.sh示例
#!/bin/bash
TARGET_HOST=$1
# 使用iptables进行网络隔离
iptables -A INPUT -s $TARGET_HOST -j DROP
iptables -A OUTPUT -d $TARGET_HOST -j DROP
# 2. 进程优雅关闭
dfs.ha.fencing.methods=shell(/usr/local/bin/graceful-fence.sh)
# graceful-fence.sh示例
#!/bin/bash
TARGET_HOST=$1
# 发送SIGTERM信号优雅关闭
ssh $TARGET_HOST "kill -TERM \$(cat /var/run/hadoop/namenode.pid)"
sleep 10
# 如果仍未关闭,强制kill
ssh $TARGET_HOST "kill -9 \$(cat /var/run/hadoop/namenode.pid)"
yaml
# 高可用配置建议
JournalNode部署:
- 奇数个节点: 3或5个
- 跨机架部署: 避免单点故障
- 独立磁盘: SSD提升性能
ZooKeeper集群:
- 奇数个节点: 3或5个
- 独立部署: 不与Hadoop节点混部
- 网络隔离: 独立网络段
xml
<!-- hdfs-site.xml -->
<configuration>
<!-- QJM配置 -->
<property>
<name>dfs.nameservices</name>
<value>mycluster</value>
</property>
<property>
<name>dfs.ha.namenodes.mycluster</name>
<value>nn1,nn2</value>
</property>
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://jn1:8485;jn2:8485;jn3:8485/mycluster</value>
</property>
<!-- 自动故障转移 -->
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
<!-- Fencing配置 -->
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/home/hadoop/.ssh/id_rsa</value>
</property>
<!-- 超时设置 -->
<property>
<name>ha.zookeeper.session-timeout.ms</name>
<value>5000</value>
</property>
</configuration>
bash
# Hadoop 3.2.1新增监控指标
namenode_ha_state # NameNode HA状态
namenode_observer_sync_lag # Observer同步延迟(新增)
journalnode_sync_lag # JournalNode同步延迟
zkfc_election_count # ZKFC选举次数
fencing_success_rate # Fencing成功率
namenode_lifeline_status # NameNode生命线状态(新增)
# JMX监控端点(Hadoop 3.2.1)
curl "http://namenode:9870/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem" | jq '.beans[0].HAState'
# Observer状态监控
curl "http://observer:9870/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem" | jq '.beans[0].ObserverReadLag'
# 增强的健康检查
curl "http://namenode:9870/jmx?qry=Hadoop:service=NameNode,name=NameNodeStatus" | jq '.beans[0].State'
# 告警规则升级
alerting_rules:
- alert: HDFSSplitBrain
expr: count(namenode_ha_state == "active") > 1
for: 30s
labels:
severity: critical
annotations:
summary: "HDFS脑裂检测:多个Active NameNode"
- alert: ObserverSyncLag
expr: namenode_observer_sync_lag > 10000
for: 2m
labels:
severity: warning
annotations:
summary: "Observer同步延迟过高"
- alert: JournalNodeQuorum
expr: count(journalnode_up) < 2
for: 1m
labels:
severity: critical
annotations:
summary: "JournalNode法定人数不足"
案例:机房网络中断导致的脑裂
时间轴:
T0: NameNode1(Active)在机房A,NameNode2(Standby)在机房B
T1: 机房间网络中断
T2: NameNode2检测不到NameNode1心跳,尝试升级为Active
T3: 由于JournalNode分布在两个机房,NameNode2无法获得多数确认
T4: 系统进入只读状态,等待网络恢复
正确结果: QJM机制阻止了脑裂,保证数据一致性
bash
# 1. 检查NameNode状态(包含Observer)
hdfs haadmin -getAllServiceState
# 输出示例:
# nn1:8020 active
# nn2:8020 standby
# nn3:8020 observer
# 2. 新增的健康检查命令
hdfs haadmin -checkHealth nn1
hdfs haadmin -checkHealth nn2
# 3. 检查Observer同步状态
hdfs dfsadmin -getDatanodeInfo nn3:8020
# 4. 详细的集群状态检查
hdfs dfsadmin -report -live -dead -entering -decommissioning
# 5. JournalNode状态检查(增强版)
for jn in jn1 jn2 jn3; do
echo "=== $jn ==="
curl -s "http://$jn:8480/jmx?qry=Hadoop:service=JournalNode,name=Journal-mycluster" | \
jq '.beans[0] | {CurrentLagTxns, LastWrittenTxId}'
done
# 6. ZKFC状态检查
systemctl status hadoop-zkfc@nn1
systemctl status hadoop-zkfc@nn2
# 7. ZooKeeper锁信息检查
zkCli.sh -server zk1:2181 <<< "ls /hadoop-ha/mycluster"
zkCli.sh -server zk1:2181 <<< "get /hadoop-ha/mycluster/ActiveStandbyElectorLock"
# 8. 网络连通性测试
for node in nn1 nn2 jn1 jn2 jn3; do
echo "Testing connectivity to $node"
timeout 5 telnet $node 8020
timeout 5 telnet $node 8485 # JournalNode端口
done
bash
# 1. 确认当前集群状态
hdfs haadmin -getAllServiceState
# 2. 检查Observer状态和同步情况
curl "http://observer:9870/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem"
# 3. 如果出现双Active,优雅降级
# 先尝试优雅切换
hdfs haadmin -transitionToStandby nn2
# 如果失败,使用强制模式
hdfs haadmin -transitionToStandby nn2 --forcemanual
# 4. 清理ZooKeeper状态(如需要)
zkCli.sh -server zk1:2181
deleteall /hadoop-ha/mycluster/ActiveStandbyElectorLock
# 5. 重启ZKFC服务
systemctl stop hadoop-zkfc@nn1
systemctl stop hadoop-zkfc@nn2
systemctl start hadoop-zkfc@nn1
systemctl start hadoop-zkfc@nn2
# 6. 验证JournalNode一致性
hdfs namenode -initializeSharedEdits -nonInteractive -force
# 7. 重启Observer节点(如果有同步问题)
systemctl stop hadoop-namenode@observer
systemctl start hadoop-namenode@observer
# 8. 数据一致性检查
hdfs fsck / -files -blocks -locations
# 9. 验证自动故障转移功能
hdfs haadmin -failover nn1 nn2
sleep 30
hdfs haadmin -failover nn2 nn1
# 10. 最终状态确认
hdfs haadmin -getAllServiceState
hdfs dfsadmin -report
# 11. 检查Observer读取功能
hdfs dfs -ls hdfs://mycluster/ # 应该能从Observer读取
# 紧急恢复脚本(Hadoop 3.2.1)
#!/bin/bash
# emergency-recovery.sh
echo "=== Hadoop 3.2.1 Emergency Recovery ==="
# 停止所有NameNode
for nn in nn1 nn2 nn3; do
ssh $nn "systemctl stop hadoop-namenode"
ssh $nn "systemctl stop hadoop-zkfc"
done
# 清理ZooKeeper状态
zkCli.sh -server zk1:2181 <<< "deleteall /hadoop-ha"
# 选择一个NameNode作为主节点重启
ssh nn1 "systemctl start hadoop-namenode"
sleep 10
# 格式化ZooKeeper HA状态
hdfs zkfc -formatZK -force
# 启动ZKFC
ssh nn1 "systemctl start hadoop-zkfc"
# 启动Standby NameNode
ssh nn2 "systemctl start hadoop-namenode"
ssh nn2 "systemctl start hadoop-zkfc"
# 启动Observer
ssh nn3 "systemctl start hadoop-namenode"
echo "Recovery completed. Checking status..."
hdfs haadmin -getAllServiceState
通过以上综合防护措施,可以有效避免Hadoop集群脑裂问题,确保生产环境的高可用性和数据一致性。
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。