Questions tagged [mrjob]

Mrjob is a Python 2.5+ package that assists the creation and running of Hadoop Streaming jobs

Mrjob is a Python 2.5+ package that assists the creation and running of Streaming jobs

Mrjob fully supports Amazon’s Elastic MapReduce () service, which allows one to buy time on a Hadoop cluster on an hourly basis. It also works with personal Hadoop clusters.

Mrjob can be installed with :

pip install mrjob
307 questions
31
votes
1 answer

Accessing stream output from hdfs of MRjob

I'm trying to use a Python driver to run an iterative MRjob program. The exit criteria depend on a counter. The job itself seems to run. If I run a single iteration from the command line, I can then hadoop fs -cat /user/myname/myhdfsdir/part-00000…
tony_tiger
  • 740
  • 1
  • 9
  • 22
12
votes
2 answers

mrjob: Invalid bootstrap action path, must be a location in Amazon S3

I am on windows 7. I installed mrjob and when I run the example word_count file from the website, it works fine on the local machine. However, I get the error when attempting to run it on Amazon EMR. I even tested connecting to amazon s3 with just…
KJW
  • 14,248
  • 44
  • 128
  • 236
11
votes
2 answers

How can python subprocess.Popen see select.poll and then later not? (select 'module' object has no attribute 'poll')

I'm using the (awesome) mrjob library from Yelp to run my python programs in Amazon's Elastic Map Reduce. It depends on subprocess in the standard python library. From my mac running python2.7.2, everything works as expected However, when I…
user1181407
  • 295
  • 1
  • 3
  • 8
11
votes
4 answers

Running a job using hadoop streaming and mrjob: PipeMapRed.waitOutputThreads(): subprocess failed with code 1

Hey I'm fairly new to the world of Big Data. I came across this tutorial on http://musicmachinery.com/2011/09/04/how-to-process-a-million-songs-in-20-minutes/ It describes in detail of how to run MapReduce job using mrjob both locally and on…
Kiran Karanth
  • 113
  • 1
  • 1
  • 8
9
votes
2 answers

Numpy and Scipy with Amazon Elastic MapReduce

Using the mrjob to run python code on Amazon's Elastic MapReduce I have successfully found a way to upgrade the EMR image's numpy and scipy. Running from console the following commands work: tar -cvf py_bundle.tar mymain.py Utils.py…
jtman
  • 113
  • 2
  • 6
7
votes
5 answers

Multiple Inputs with MRJob

I'm trying to learn to use Yelp's Python API for MapReduce, MRJob. Their simple word counter example makes sense, but I'm curious how one would handle an application involving multiple inputs. For instance, rather than simply counting the words in a…
follyroof
  • 2,712
  • 1
  • 25
  • 26
7
votes
2 answers

How to get the name of input file in MRjob

I'm writing a map function using mrjob. My input will come from files in a directory on HDFS. Names of the files contain a small but crucial piece information that is not present in the files. Is there a way to learn (inside a map function) the…
Bolo
  • 10,774
  • 5
  • 38
  • 58
6
votes
2 answers

mrjob: setup logging on EMR

I'm trying to use mrjob for running hadoop on EMR, and can't figure out how to setup logging (user generated logs in map/reduce steps) so I will be able to access them after the cluster is terminated. I have tried to setup logging using the logging…
Beka
  • 605
  • 4
  • 20
6
votes
0 answers

Processing MongoDB data using Mrjob on Amazon EMR

I know that Mrjob uses Hadoop Streaming. I also know that there is a plugin for using MongoDB with Hadoop Streaming. However, I couldn't find any examples on bringing two together. Is this (at least in theory) possible? If so, are there any relevant…
Eser Aygün
  • 6,566
  • 1
  • 17
  • 26
6
votes
2 answers

Python Dependency Management on EMR

i'm sending code to amazon's EMR via the mrjob/boto modules. i've got some external python dependencies (ie. numpy, boto, etc) and currently have to download the source of the python packages, and send them over as a tarball in the "python_archives"…
follyroof
  • 2,712
  • 1
  • 25
  • 26
6
votes
1 answer

How can I iterately process all files under one directory using mrjob

I am using mrjob to process a batch of files and get some statistics. I know I can run mapreduce job on a single file, like python count.py < some_input_file > output But how can I feed a directory of files to the script? The file directory…
Chunliang Lyu
  • 1,702
  • 19
  • 34
6
votes
1 answer

How does one specify the input file for a runner from Python?

I am writing an external script to run a mapreduce job via the Python mrjob module on my laptop (not on Amazon Elastic Compute Cloud or any large cluster). I read from the mrjob documentation that I should use MRJob.make_runner() to run a mapreduce…
dangerChihuahua007
  • 18,433
  • 28
  • 104
  • 190
5
votes
2 answers

Run MRJob from IPython notebook

I'm trying to run mrjob example from IPython notebook from mrjob.job import MRJob class MRWordFrequencyCount(MRJob): def mapper(self, _, line): yield "chars", len(line) yield "words", len(line.split()) yield "lines", 1 def…
szu
  • 864
  • 7
  • 22
4
votes
1 answer

How do I put a print statement in mrjob code for debugging purposes?

How do I put a debug statement (like print) in reducer or mapper for mrjob. If I try to use print or sys.stderr.write(), I get an error TypeError: a bytes-like object is required, not 'str'
4
votes
2 answers

Python Module Import Error "ImportError: No module named mrjob.job"

System: Mac OSX 10.6.5, Python 2.6 I try to run the python script below: from mrjob.job import MRJob class MRWordCounter(MRJob): def mapper(self, key, line): for word in line.split(): yield word, 1 def reducer(self, word,…
worker1138
  • 1,971
  • 4
  • 26
  • 36
1
2 3
20 21