I'm currently trying to implement a Binary Pig (see https://github.com/endgameinc/binarypig for more information) Cluster to analyze Malware Binaries with Hadoop and Pig. I used Cloudera CDH for installing Hadoop and Pig.
My Pig script is as follows:
SET debug 'on';
register '/home/myuser/binarypig-1.0-SNAPSHOT-jar-with-dependencies.jar';
SET mapred.cache.files /tmp/scripts#scripts;
SET mapred.create.symlink yes;
%default INPUT 'hdfs://namenode1:8020/bla/test/malware.archive.seq'
%default TIMEOUT_MS '180000'
%default USE_DEVSHM 'true'
data = load '$INPUT' using com.endgame.binarypig.loaders.ExecutingTextLoader('scripts/strings.sh', '$TIMEOUT_MS', '$USE_DEVSHM');
DUMP data;
The bash script strings.sh is just executing the unix "string" command to collect all the strings of each file within the malware.archive.seq container. I'm running the script with on my namenode:
pig -f strings.pig
For some reason my the job always fails with the following error messages:
Failed!
Failed Jobs:
JobId Alias Feature Message Outputs
job_1440074864855_0058 data MAP_ONLY Message: Job failed! hdfs://namenode1:8020/tmp/temp-362821719/tmp-171792164,
Input(s):
Failed to read data from "hdfs://namenode1:8020/bla/test/malware.zip.seq"
Output(s):
Failed to produce result in "hdfs://namenode1:8020/tmp/temp-362821719/tmp- 171792164"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_1440074864855_0058
2015-08-25 17:07:21,616 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
2015-08-25 17:07:21,616 [main] DEBUG org.apache.pig.impl.io.InterStorage - Pig Internal storage in use
2015-08-25 17:07:21,622 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias data
The file hdfs://namenode1:8020/bla/test/malware.zip.seq does exist and the rights are set to 777 just to exclude permission errors.
Since my guess is that it has something to do with the load command within the pig script, here are the debug messages for the load command:
2015-08-25 17:07:06,639 [main] DEBUG org.apache.pig.parser.QueryParserDriver - Original macro AST:
(QUERY (STATEMENT data (load 'hdfs://namenode1:8020/bla/test/malware.zip.seq' (FUNC com . endgame . binarypig . loaders . ExecutingTextLoader 'scripts/strings.sh' '180000' 'true'))))
2015-08-25 17:07:06,640 [main] DEBUG org.apache.pig.parser.QueryParserDriver - macro AST after import:
(QUERY (STATEMENT data (load 'hdfs://namenode1:8020/bla/test/malware.zip.seq' (FUNC com . endgame . binarypig . loaders . ExecutingTextLoader 'scripts/strings.sh' '180000' 'true'))))
2015-08-25 17:07:06,640 [main] DEBUG org.apache.pig.parser.QueryParserDriver - Resulting macro AST:
(QUERY (STATEMENT data (load 'hdfs://namenode1:8020/bla/test/malware.zip.seq' (FUNC com . endgame . binarypig . loaders . ExecutingTextLoader 'scripts/strings.sh' '180000' 'true'))))
2015-08-25 17:07:06,961 [main] DEBUG org.apache.pig.parser.QueryParserDriver - Original macro AST:
(QUERY (STATEMENT data (load 'hdfs://namenode1:8020/bla/test/malware.zip.seq' (FUNC com . endgame . binarypig . loaders . ExecutingTextLoader 'scripts/strings.sh' '180000' 'true'))))
2015-08-25 17:07:06,961 [main] DEBUG org.apache.pig.parser.QueryParserDriver - macro AST after import:
(QUERY (STATEMENT data (load 'hdfs://namenode1:8020/bla/test/malware.zip.seq' (FUNC com . endgame . binarypig . loaders . ExecutingTextLoader 'scripts/strings.sh' '180000' 'true'))))
2015-08-25 17:07:06,961 [main] DEBUG org.apache.pig.parser.QueryParserDriver - Resulting macro AST:
(QUERY (STATEMENT data (load 'hdfs://namenode1:8020/bla/test/malware.zip.seq' (FUNC com . endgame . binarypig . loaders . ExecutingTextLoader 'scripts/strings.sh' '180000' 'true'))))
Does anyone have an idea how to fix this or even how to debug this?
Edit (pig_log added):
Pig Stack Trace
---------------
ERROR 1066: Unable to open iterator for alias data
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias data
at org.apache.pig.PigServer.openIterator(PigServer.java:892)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:774)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:372)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
at org.apache.pig.Main.run(Main.java:478)
at org.apache.pig.Main.main(Main.java:156)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.io.IOException: Job terminated with anomalous status FAILED
at org.apache.pig.PigServer.openIterator(PigServer.java:884)
... 13 more
================================================================================