we have HDP cluster , version 2.6.5
cluster include management of two name-node ( one is active and the secondary is standby )
and 65 datanode machines
we have problem with the standby name-node that not started and from the namenode logs we can see the following
2021-01-01 15:19:43,269 ERROR namenode.NameNode (NameNode.java:main(1783)) - Failed to start namenode.
java.io.IOException: There appears to be a gap in the edit log. We expected txid 90247527115, but got txid 90247903412.
at org.apache.hadoop.hdfs.server.namenode.MetaRecoveryContext.editLogLoaderPrompt(MetaRecoveryContext.java:94)
at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:215)
at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:143)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:838)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:693)
from ambari we can see that standby is down
for now the active namenode is up but the standby name node is down , and the root cause for this issue is because namenode matadata is damaged/corrupted.
so we have two solution - A or B
A)
run the following recover on standby namenode
su
hadoop namenode -recover
B)
Put Active NN in safemode
su hdfs
hdfs dfsadmin -safemode enter
Do a savenamespace operation on Active NN
su hdfs
hdfs dfsadmin -saveNamespace
Leave Safemode
su hdfs
hdfs dfsadmin -safemode leave
Login to Standby NN
Run below command on Standby namenode to get latest fsimage that we saved in above steps.
su hdfs
hdfs namenode -bootstrapStandby -force
what is the preferred solution for our problem?