Automatic Failover not working in Hadoop

Question

I'm trying to build a 3 node cluster (2 Namenode(nn1,nn2) and 1 datanode(dn1)) .Using Namenode WEBUI, I'm able to view that nn1 is active and nn2 is standby. however, when I kill the active nn1, standby nn2 is not going active. Please help me what am I doing wrong or what needs to be modified

nn1 /etc/hosts

127.0.0.1 localhost
192.168.10.153 nn1
192.168.10.154 dn1
192.168.10.155 nn2

nn2 /etc/hosts

127.0.0.1       localhost nn2
127.0.1.1       ubuntu

    # The following lines are desirable for IPv6 capable hosts
    ::1     ip6-localhost ip6-loopback
    fe00::0 ip6-localnet
    ff00::0 ip6-mcastprefix
    ff02::1 ip6-allnodes
    ff02::2 ip6-allrouters

core-site.xml (nn1,nn2)

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://192.168.10.153:8020</value>
</property>

<property>
<name>dfs.journalnode.edits.dir</name>
<value>/usr/local/hadoop/hdfs/data/jn</value>
</property>
 <property>
 <name>ha.zookeeper.quorum</name>
 <value>192.168.10.153:2181,192.168.10.155:2181,192.168.10.154:2181</value>
 </property>

</configuration>

hdfs-site.xml(nn1,nn2,dn1)

<property>
 <name>dfs.replication</name>
 <value>1</value>
 </property>
 <property>
 <name>dfs.permissions</name>
 <value>false</value>
 </property>
 <property>
 <name>dfs.nameservices</name>
 <value>ha-cluster</value>
 </property>
 <property>
 <name>dfs.ha.namenodes.ha-cluster</name>
 <value>nn1,nn2</value>
 </property>
 <property>
 <name>dfs.namenode.rpc-address.ha-cluster.nn1</name>
 <value>192.168.10.153:9000</value>
 </property>
 <property>
 <name>dfs.namenode.rpc-address.ha-cluster.nn2</name>
 <value>192.168.10.155:9000</value>
 </property>
 <property>/usr/local/hadoop/hdfs/datanode</value>
 <name>dfs.namenode.http-address.ha-cluster.nn1</name>
 <value>192.168.10.153:50070</value>
 </property>
 <property>
 <name>dfs.namenode.http-address.ha-cluster.nn2</name>
 <value>192.168.10.155:50070</value>
 </property>
 <property>
 <name>dfs.namenode.shared.edits.dir</name>
 <value>qjournal://192.168.10.153:8485;192.168.10.155:8485;192.168.10.154:8485/ha-cluster</value>
 </property>
 <property>
 <name>dfs.client.failover.proxy.provider.ha-cluster</name>
 <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
 </property>
 <property>
 <name>dfs.ha.automatic-failover.enabled</name>
 <value>true</value>
 </property>
 <property>
 <name>ha.zookeeper.quorum</name>
 <value>192.168.10.153:2181,192.168.10.155:2181,192.168.10.154:2181</value>
 </property>

<property>
 <name>dfs.ha.fencing.methods</name>
 <value>sshfence</value>
 </property>
 <property>
 <name>dfs.ha.fencing.ssh.private-key-files</name>
 <value>/home/ci/.ssh/id_rsa</value></property></configuration>

LOGS :(zkfc nn1,nn2)(namenode nn1,nn2) on stopping nn1(active node) https://pastebin.com/bWvfnanQ

ǨÅVËĔŊ RĀǞĴĄŅ · Answer 1 · 2017-04-19T11:17:39.347

2

Your mentioning <IP>:<port> for fs.defaultFS in core-site.xml for a HA cluster. So when shutting down your active namenode, it doesn't know where to redirect.

Choose logical name for a nameservice, for example “mycluster”.

Then change in hdfs-site.xml as well, dfs.namenode.http-address.[nameservice ID].[name node ID] - the fully-qualified HTTP address for each NameNode to listen on

In your case, you have to give

core-site.xml

<property>
<name>fs.defaultFS</name>
<value>hdfs://myCluster</value>
</property>

hdfs-site.xml

 <property>
 <name>dfs.namenode.rpc-address.myCluster.nn1</name>
 <value>192.168.10.153:9000</value>
 </property>

Read the manual clearly https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html

Hope this will help you.

edited Apr 19 '17 at 11:17

answered Apr 19 '17 at 11:05

ǨÅVËĔŊ RĀǞĴĄŅ

520
5
27

1

Thanks , Now have received this error in NN2 ZKFC :::: FATAL org.apache.hadoop.ha.ZKFailoverController: Unable to start failover controller. Parent znode does not exist. Run with -formatZK flag to initialize ZooKeeper. 2017-04-19 04:53:06,170 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected. 2017-04-19 04:53:06,175 INFO org.apache.zookeeper.ZooKeeper: Session: 0x15b859d1b370005 closed 2017-04-19 04:53:06,176 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down and unable to start ZKFC in nn2 . please help . – Raj Apr 19 '17 at 11:53
I have modified a cluster configuration based on the link https://www.edureka.co/blog/how-to-set-up-hadoop-cluster-with-hdfs-high-availability/ I have formatted the ZK, restarted the cluster. Still not working for me. tried to resolve the issues asap. – Raj Apr 19 '17 at 12:16
ci@nn1:/usr/local/hadoop/sbin$ hdfs haadmin -getServiceState nn1 active ci@nn1:/usr/local/hadoop/sbin$ hdfs haadmin -getServiceState nn2 active both the name nodes are active . If I stop the nn1 and start nn1 . zkfc is not running . it is same for nn2 . Is that right or Am i missing something . – Raj Apr 19 '17 at 12:59
ci@nn1:/usr/local/hadoop/sbin$ hdfs haadmin -getServiceState nn2 active ci@nn1:/usr/local/hadoop/sbin$ hdfs haadmin -getServiceState nn1 active I have started NN, JN in both active and standby namenodes. ZKFC is running on both nodes. Now, I have paused the Internet connectivity in NN1 and Yes NN2 becomes active. However when I resumed the Internet connectivity and started NN1. both namenode becomes active. Zkfc is running on both nodes. please guide. – Raj Apr 20 '17 at 05:53

score 0 · Answer 2 · edited May 23 '17 at 12:09

0

You have to look fencing for automatic failover

https://stackoverflow.com/a/27272565/3496666

https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html

edited May 23 '17 at 12:09

Community

1
1

answered Apr 19 '17 at 08:54

Kumar

3,246
4
34
78

Can you suggest me a solution for this problem?I cannot help myself to get out of this issue. I have already gone through those articles. – Raj Apr 19 '17 at 09:20
Look at this answer. It may solve your problem. I think it should be a fencing problem. http://stackoverflow.com/a/36002948/3496666 – Kumar Apr 20 '17 at 04:27
http://stackoverflow.com/a/33313804/3496666 Read this to know how Automatic failover works. – Kumar Apr 20 '17 at 04:28
Yes, I have made changes based on your second comment. it is not working.Now I have full privileges to the user account nn1. still not working . – Raj Apr 20 '17 at 04:45
ci@nn1:/usr/local/hadoop/sbin$ hdfs haadmin -getServiceState nn2 active ci@nn1:/usr/local/hadoop/sbin$ hdfs haadmin -getServiceState nn1 active I have started NN, JN in both active and standby namenodes. ZKFC is running on both nodes. Now, I have paused the Internet connectivity in NN1 and Yes NN2 becomes active. However when I resumed the Internet connectivity and started NN1. both namenode becomes active. Zkfc is running on both nodes. please guide. – Raj Apr 20 '17 at 05:53
ZKFC should be run in both namenode. But only one NN should be active and other one should be standby. There is no possibility of both NN active in HA cluster. Can you open the NN web ui and check it once. – Kumar Apr 20 '17 at 06:01
yes . I have checked. Still, both the NN are active. – Raj Apr 20 '17 at 06:39
As @Kaveen said, what you have set in fs.defaultFS property? – Kumar Apr 20 '17 at 06:40
As per your configuration, set fs.defaultFS value to hdfs://ha-cluster – Kumar Apr 20 '17 at 06:44
I have changed to the configuration as Kaveen instructed. please anything else u want me to try other based on my configuration . – Raj Apr 20 '17 at 07:00
Stop all services, delete metadata and zookeeper data. Format namenode and start all services again. – Kumar Apr 20 '17 at 07:08
Found problem?? – Kumar Apr 24 '17 at 05:50
No , I dropped and started all over again . i'm trying in Centos 7 now .thanks for the help – Raj Apr 24 '17 at 10:12

Automatic Failover not working in Hadoop

2 Answers2