How to debug hadoop mapreduce jobs from eclipse?

Question

I'm running hadoop in a single-machine, local-only setup, and I'm looking for a nice, painless way to debug mappers and reducers in eclipse. Eclipse has no problem running mapreduce tasks. However, when I go to debug, it gives me this error :

12/03/28 14:03:23 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).

Okay, so I do some research. Apparently, I should use eclipse's remote debugging facility, and add this to my hadoop-env.sh :

-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5000

I do that and I can step through my code in eclipse. Only problem is that, because of the "suspend=y", I can't use the "hadoop" command from the command line to do things like look at the job queue; it hangs, I'm imagining because it's waiting for a debugger to attach. Also, I can't run "hbase shell" when I'm in this mode, probably for the same reason.

So basically, if I want to flip back and forth between "debug mode" and "normal mode", I need to update hadoop-env.sh and restart my machine. Major pain. So I have a few questions :

Is there an easier way to do debug mapreduce jobs in eclipse?
How come eclipse can run my mapreduce jobs just fine, but for debugging I need to use remote debugging?
Is there a way to tell hadoop to use remote debugging for mapreduce jobs, but to operate in normal mode for all other tasks? (such as "hadoop queue" or "hbase shell").
Is there an easier way to switch hadoop-env.sh configurations without rebooting my machine? hadoop-env.sh is not executable by default.
This is a more general question : what exactly is happening when I run hadoop in local-only mode? Are there any processes on my machine that are "always on" and executing hadoop jobs? Or does hadoop only do things when I run the "hadoop" command from the command line? What is eclipse doing when I run a mapreduce job from eclipse? I had to reference hadoop-core in my pom.xml in order to make my project work. Is eclipse submitting jobs to my installed hadoop instance, or is it somehow running it all from the hadoop-core-1.0.0.jar in my maven cache?

Here is my Main class :

public class Main {
      public static void main(String[] args) throws Exception {     
        Job job = new Job();
        job.setJarByClass(Main.class);
        job.setJobName("FirstStage");

        FileInputFormat.addInputPath(job, new Path("/home/sangfroid/project/in"));
        FileOutputFormat.setOutputPath(job, new Path("/home/sangfroid/project/out"));

        job.setMapperClass(FirstStageMapper.class);
        job.setReducerClass(FirstStageReducer.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        System.exit(job.waitForCompletion(true) ? 0 : 1);
      }
}

As an aside, if you're just trying to debug your mapper / reducer logic, you should look into using MRUnit (http://www.cloudera.com/blog/2009/07/debugging-mapreduce-programs-with-mrunit/) — Chris White, Mar 28 '12 at 23:40
As @Chris White suggests starting with MRUnit to test Map/Reduce logic is a good idea: http://incubator.apache.org/projects/mrunit.html — Binary Nerd, Mar 29 '12 at 00:42

score 8 · Answer 1 · edited Oct 23 '12 at 11:31

8

Make changes in /bin/hadoop (hadoop-env.sh) script. Check to see what command has been fired. If the command is jar, then only add remote debug configuration.

if [ "$COMMAND" = "jar" ] ; then
  exec "$JAVA" -Xdebug -Xrunjdwp:transport=dt_socket,server=y,address=8999 $JAVA_HEAP_MAX $HADOOP_OPTS $CLASS "$@"
else
  exec "$JAVA" $JAVA_HEAP_MAX $HADOOP_OPTS $CLASS "$@"
fi

edited Oct 23 '12 at 11:31

nalply

20,652
12
75
93

answered Oct 23 '12 at 11:13

Jagdeep Singh

81
1
2

I didn't try exactly this but I replaced $JAVA with jdb (I was trying to debug using jdb). jdb never recognized the breakpoint I tried to place where I wanted the program to stop. I'm assuming the problem was that I wasn't running in local mode. I haven't tried it yet but I'm assuming Kapil D's suggestion is what I need to follow. – ali-hussain Apr 15 '13 at 21:15
4

You could alos add the debugging options to your shell's $HADOOP_OPTS var, and not have to modify the hadoop script. export HADOOP_OPTS="$HADOOP_OPTS -Xdebug -Xrunjdwp:transport=dt_socket,server=y,address=8999" – Steve Goodman May 02 '13 at 18:33

score 5 · Answer 2 · edited Dec 08 '14 at 15:53

5

The only way you can debug hadoop in eclipse is running hadoop in local mode. The reason being, each map reduce task run in ist own JVM and when you don't hadoop in local mode, eclipse won't be able to debug.

When you set hadoop to local mode, instead of using hdfs API(which is default), hadoop file system changes to file:///. Thus, running hadoop fs -ls will not be a hdfs command, but more of hadoop fs -ls file:///, a path to your local directory. None of the JobTracker or NameNode runs.

These blogposts might help:

edited Dec 08 '14 at 15:53

AdrieanKhisbe

3,532
7
31
45

answered Jun 12 '12 at 00:47

Kapil D

2,526
6
27
30

Hi @Kapil, What you described is possible in Hadoop 2.4 (with Yarn, etc..). I'm trying to run a local job in eclipse with the new version and facing `Cannot initialize Cluster. Please check your configuration...` – Pedro Dusso Apr 18 '14 at 12:21
@PedroDusso have you gotten local debug to work with Hadoop 2.4+? – erichfw Jan 12 '15 at 23:29
@erichfw I never tried... I was using 2.2 in the time I asked this question. – Pedro Dusso Jan 13 '15 at 18:01

score 1 · Answer 3 · answered Mar 29 '12 at 16:00

Besides the recommended MRUnit I like to debug with eclipse as well. I have a main program. It instantiates a Configuration and executes the MapReduce job directly. I just debug with standard eclipse Debug configurations. Since I include hadoop jars in my mvn spec, I have all hadoop per se in my class path and I have no need to run it against my installed hadoop. I always test with small data sets in local directories to make things easy. The defaults for the configuration behaves as a stand alone hadoop (file system is available)

Thanks for your answer. I, too, have hadoop-core set up as a dependency in my POM. Since that is the case, why am I getting the "No job jar file set" error? Is it because I'm calling job.setJarByClass()? Could you please post some example code? — sangfroid, Mar 29 '12 at 17:49

score 0 · Answer 4 · answered Sep 20 '12 at 18:31

I also like to debug via unit test w/MRUnit. I will use this in combination with approvaltests which creates an easy visualization of the Map Reduce process, and makes it easy to pass in scenarios that are failing. It also runs seamlessly from eclipse.

For example:

HadoopApprovals.verifyMapReduce(new WordCountMapper(), 
                         new WordCountReducer(), 0, "cat cat dog");

Will produce the output:

[cat cat dog] 
-> maps via WordCountMapper to ->
(cat, 1) 
(cat, 1) 
(dog, 1)

-> reduces via WordCountReducer to ->
(cat, 2) 
(dog, 1)

There's a video on the process here: http://t.co/leExFVrf

score 0 · Answer 5 · answered Jan 18 '19 at 11:48

0

Adding args to hadoop's internal java command can be done via HADOOP_OPTS env variable:

export HADOOP_OPTS="-Xdebug -Xrunjdwp:transport=dt_socket,server=y,address=5005,suspend=y"

answered Jan 18 '19 at 11:48

Honza

4,210
2
22
37

Udo · Answer 6 · 2021-05-04T08:04:37.443

You can pass the debugging parameters via -Dmapreduce.map.java.opts. For example you can run HBase Import and run the mappers in debug mode:

yarn jar your/path/to/hbase-mapreduce-2.2.5.jar import
     -Dmapreduce.map.speculative=false 
     -Dmapreduce.reduce.speculative=false 
     -Dmapreduce.map.java.opts="-Xdebug -Xrunjdwp:transport=dt_socket,server=y,address=5005,suspend=y" 
     my_table /path/in/hdfs

Note that this must be placed into one line w/o new lines. Other map-reduce applications can be started in the same way, the trick is to pass the debug derictives via -Dmapreduce.map.java.opts.

In Eclipse or IntelliJ you have to create a debug remote connection with

Host=127.0.0.1 (or even a remote IP address in case Hadoop runs elsewhere)
Port=5005

I managed to debug the Import this way. In addition you can limit the number of mappers to 1 as described here but this was not necessary for me.

Once the map-reduve application is started switch to your IDE an try to launch your debug settings which will fail in the beginning. Repeat it until the debugger hooks into the application. Don't forget to set a breakpoint before hand.

In case you don't want to debug your application only but also the surrounding HBase/Hadoop framework, you can download them here and here (chosse your version via the "switch branch/tags" menu button).

How to debug hadoop mapreduce jobs from eclipse?

6 Answers6

Linked

Related