22

I'm working on a job that processes a nested directory structure, containing files on multiple levels:

one/
├── three/
│   └── four/
│       ├── baz.txt
│       ├── bleh.txt
│       └── foo.txt
└── two/
    ├── bar.txt
    └── gaa.txt

When I add one/ as an input path, no files are processed, since none are immediately available at the root level.

I read about job.addInputPathRecursively(..), but this seems to have been deprecated in the more recent releases (I'm using hadoop 1.0.2). I've written some code to walk the folders and add each dir with job.addInputPath(dir), which worked until the job crashed when trying to process a directory as an input file for some reason, e.g. - trying to fs.open(split.getPath()), when split.getPath() is a directory (This happens inside LineRecordReader.java).

I'm trying to convince myself there has to be a simpler way to provide a job with a nested directory structure. Any ideas?

EDIT - apparently there's an open bug on this.

sa125
  • 25,703
  • 36
  • 105
  • 149
  • 3
    Is it so diffucult to use `FileSystem#listStatus()` and add them recursively? – Thomas Jungblut Apr 18 '12 at 13:53
  • I am solving it in the similar way - wrote recursive code which traverse subdirectories and add all files to input Paths – David Gruzman Apr 18 '12 at 19:02
  • 1
    @ThomasJungblut that's basically my current approach. I just find it odd that this functionality is not built in. Another issue I'm having is that hadoop crashes when it accesses a sub folder without any files in it, just other folders (like `one` and `one/three` in my example). So basically I need to implement logic that will add folders recursively unless they **only** have other folders in them, instead of files (still have to walk their content to add nested files). Seems like a lot of trouble just to set up a job. – sa125 Apr 19 '12 at 04:06
  • You could write a PathFilter that only accepts files, and then use the FileInputFormat.setInputPathFilter method - http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html – Chris White May 10 '12 at 03:26
  • possible duplicate of [FileStatus use to recurse directory](http://stackoverflow.com/questions/17618535/filestatus-use-to-recurse-directory) – Suvarna Pattayil Nov 10 '13 at 05:25

5 Answers5

14

I didn't found any document on this but */* works. So it's -input 'path/*/*'.

Cheng
  • 4,436
  • 4
  • 38
  • 43
7

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

FileInputFormat.setInputDirRecursive(job, true);

No thanks, just call me LeiFeng!

backingwu
  • 491
  • 6
  • 7
4

I find recursively going through data can be dangerous since there may be lingering log files from a distcp or something similar. Let me propose an alternative:

Do the recursive walk on the command line, and then pass in the paths in a space-delimited parameter into your MapReduce program. Grab the list from argv:

$ hadoop jar blah.jar "`hadoop fs -lsr recursivepath | awk '{print $8}' | grep '/data.*\.txt' | tr '\n' ' '`"

Sorry for the long bash, but it gets the job done. You could wrap the thing in a bash script to break things out into variables.

I personally like the pass-in-filepath approach to writing my mapreduce jobs so the code itself doesn't have hardcoded paths and it's relatively easy for me to set it up to run against more complex list of files.

Donald Miner
  • 35,795
  • 7
  • 88
  • 113
  • Thanks for this. Do you know if there is any reason to do it this way vs. FileInputFormat.addInputPaths("comma seperated file from the above bash")? – dranxo Aug 01 '12 at 22:32
  • Interesting, any reason why? I'm quite new to Hadoop but ran into this -lsr problem already. – dranxo Aug 02 '12 at 01:28
2

Don't know if still relevant but at least in hadoop 2.4.0 you can set property mapreduce.input.fileinputformat.input.dir.recursive to true and it will solve your problem.

Eitan Illuz
  • 301
  • 2
  • 7
-1

just use FileInputFormat.addInputPath("with file pattern"); i am writing my first hadoop prog for graph analysis where input is from diff dir in .gz format ... it worked for me !!!

Vishal Kumar
  • 742
  • 6
  • 13