3

I am having issues running pig streaming. When I start up an interactive pig instance (fyi, I am doing this on the master node of an interactive pig AWS EMR instance via SSH/Putty) with only one machine my pig streaming work perfectly (it also works on my windows cloudera VM image). However, when I switch to using more than one computer, it simply stops working and give various errors.

Note that:

  • I am able to run Pig scripts that don’t have any stream commands with no problem on a multi computer instance.
  • all my pig work is being done in pig MapReduce mode rather than –x local mode.
  • my python script (stream1.py) has this on top #!/usr/bin/env python

Below is small sample of the options I have tried so far (all of the below commands are done in the grunt shell on the master/main node, which I am accessing via ssh/putty):

This is how I get the python file onto the mater node so it can be used:

cp s3n://darin.emr-logs/stream1.py stream1.py
copyToLocal stream1.py /home/hadoop/stream1.py
chmod 755 stream1.py

These are my various stream attemts:

cooc = stream ct_pag_ph through `stream1.py`
dump coco;
ERROR 2090: Received Error while processing the reduce plan: 'stream1.py ' failed with exit status: 127

cooc = stream ct_pag_ph through `python stream1.py`;
dump coco;
ERROR 2090: Received Error while processing the reduce plan: 'python stream1.py ' failed with exit status: 2

DEFINE X `stream1.py`; 
cooc = stream ct_bag_ph through X;
dump coco;
ERROR 2090: Received Error while processing the reduce plan: 'stream1.py ' failed with exit status: 127

DEFINE X `stream1.py`; 
cooc = stream ct_bag_ph through `python X`;
dump coco;
ERROR 2090: Received Error while processing the reduce plan: 'python X ' failed with exit status: 2

DEFINE X `stream1.py` SHIP('stream1.py');
cooc = STREAM ct_bag_ph THROUGH X;
dump cooc;
ERROR 2017: Internal error creating job configuration.

DEFINE X `stream1.py` SHIP('/stream1.p');
cooc = STREAM ct_bag_ph THROUGH X;
dump cooc;

DEFINE X `stream1.py` SHIP('stream1.py') CACHE('stream1.py');
cooc = STREAM ct_bag_ph THROUGH X;
ERROR 2017: Internal error creating job configuration.

define X 'python /home/hadoop/stream1.py' SHIP('/home/hadoop/stream1.py');
cooc = STREAM ct_bag_ph THROUGH X;
Avada Kedavra
  • 7,575
  • 5
  • 29
  • 46
Darin
  • 31
  • 1
  • 2

1 Answers1

2
DEFINE X `stream1.py` SHIP('stream1.py');

Appears valid to me according to your preconditions and having stream1.py in your current local directory.

A way to be sure of this:

DEFINE X `python stream1.py` SHIP('/local/path/stream1.py');

The goal of SHIP is to copy the command in the working directory of all the tasks.

Romain
  • 6,804
  • 3
  • 28
  • 30
  • Thanks for the confirmation on this. Perhaps these are working for me but the problem is with the the execution of the python itself (not finding it). Since I am quite sure the python code doesn't throw any errors and doesn't use any additional libraries could it have something to do with the first line of the file: #!/usr/bin/env python - guess I don't fully understand this statement. – Darin Aug 01 '11 at 00:11
  • OK so I tested with both #!/usr/bin/env python and #!/usr/bin/python and with the following DEFINE statements: define X `stream_t.py` SHIP('stream_t.py'); define X `python stream_t.py` SHIP('stream_t.py'); define X `stream_t.py` SHIP('/home/hadoop/stream_t.py'); define X `stream_t.py` SHIP('stream_t.py') CACHE('stream_t.py'); However I still isn't working as I continually get ERROR 2017: Internal error creating job configuration. – Darin Aug 02 '11 at 05:44
  • With some help from the AWS EMR folks this has been resolve and the following command does the trick: define X \`stream_t.py\` SHIP('/home/hadoop/stream_t.py'); – Darin Aug 05 '11 at 02:22
  • Cool, so it needed the correct local path. Regarding the "shebang" you can have a look to http://stackoverflow.com/questions/2429511/why-do-people-write-usr-bin-env-python-on-the-first-line-of-a-python-script – Romain Aug 08 '11 at 19:49