Hadoop MapReduce provide nested directories as job input

Question

I'm working on a job that processes a nested directory structure, containing files on multiple levels:

one/
├── three/
│   └── four/
│       ├── baz.txt
│       ├── bleh.txt
│       └── foo.txt
└── two/
    ├── bar.txt
    └── gaa.txt

When I add one/ as an input path, no files are processed, since none are immediately available at the root level.

I read about job.addInputPathRecursively(..), but this seems to have been deprecated in the more recent releases (I'm using hadoop 1.0.2). I've written some code to walk the folders and add each dir with job.addInputPath(dir), which worked until the job crashed when trying to process a directory as an input file for some reason, e.g. - trying to fs.open(split.getPath()), when split.getPath() is a directory (This happens inside LineRecordReader.java).

I'm trying to convince myself there has to be a simpler way to provide a job with a nested directory structure. Any ideas?

EDIT - apparently there's an open bug on this.

Is it so diffucult to use `FileSystem#listStatus()` and add them recursively? — Thomas Jungblut, Apr 18 '12 at 13:53
I am solving it in the similar way - wrote recursive code which traverse subdirectories and add all files to input Paths — David Gruzman, Apr 18 '12 at 19:02
@ThomasJungblut that's basically my current approach. I just find it odd that this functionality is not built in. Another issue I'm having is that hadoop crashes when it accesses a sub folder without any files in it, just other folders (like `one` and `one/three` in my example). So basically I need to implement logic that will add folders recursively unless they **only** have other folders in them, instead of files (still have to walk their content to add nested files). Seems like a lot of trouble just to set up a job. — sa125, Apr 19 '12 at 04:06
You could write a PathFilter that only accepts files, and then use the FileInputFormat.setInputPathFilter method - http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html — Chris White, May 10 '12 at 03:26
possible duplicate of [FileStatus use to recurse directory](http://stackoverflow.com/questions/17618535/filestatus-use-to-recurse-directory) — Suvarna Pattayil, Nov 10 '13 at 05:25

Cheng · Answer 1 · 2013-11-10T05:20:47.763

14

I didn't found any document on this but */* works. So it's -input 'path/*/*'.

edited Nov 10 '13 at 05:20

answered Aug 13 '12 at 06:57

Cheng

4,436
4
38
43

u sure this isnt being expanded in bash(or your shell) and launching tons of hadoop instances? – jbu Jun 08 '13 at 06:50
I have single quotes around them. – Cheng Jun 08 '13 at 15:05
Running `ps -aux` would help clear the issue mentioned by @jbu – gokul_uf Sep 14 '15 at 21:09

score 7 · Answer 2 · answered Dec 31 '14 at 03:43

7

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

FileInputFormat.setInputDirRecursive(job, true);

No thanks, just call me LeiFeng!

answered Dec 31 '14 at 03:43

backingwu

491
6
7

score 4 · Answer 3 · answered Apr 18 '12 at 17:31

I find recursively going through data can be dangerous since there may be lingering log files from a distcp or something similar. Let me propose an alternative:

Do the recursive walk on the command line, and then pass in the paths in a space-delimited parameter into your MapReduce program. Grab the list from argv:

$ hadoop jar blah.jar "`hadoop fs -lsr recursivepath | awk '{print $8}' | grep '/data.*\.txt' | tr '\n' ' '`"

Sorry for the long bash, but it gets the job done. You could wrap the thing in a bash script to break things out into variables.

I personally like the pass-in-filepath approach to writing my mapreduce jobs so the code itself doesn't have hardcoded paths and it's relatively easy for me to set it up to run against more complex list of files.

Thanks for this. Do you know if there is any reason to do it this way vs. FileInputFormat.addInputPaths("comma seperated file from the above bash")? — dranxo, Aug 01 '12 at 22:32
Interesting, any reason why? I'm quite new to Hadoop but ran into this -lsr problem already. — dranxo, Aug 02 '12 at 01:28

score 2 · Answer 4 · answered Dec 04 '14 at 12:46

2

Don't know if still relevant but at least in hadoop 2.4.0 you can set property mapreduce.input.fileinputformat.input.dir.recursive to true and it will solve your problem.

answered Dec 04 '14 at 12:46

Eitan Illuz

301
2
7

score -1 · Answer 5 · answered Apr 27 '12 at 21:49

-1

just use FileInputFormat.addInputPath("with file pattern"); i am writing my first hadoop prog for graph analysis where input is from diff dir in .gz format ... it worked for me !!!

answered Apr 27 '12 at 21:49

Vishal Kumar

742
6
13

using name pattern is one way to avoid nested directory problem. – hakunami May 08 '14 at 12:40

Hadoop MapReduce provide nested directories as job input

5 Answers5

Linked