Questions tagged [mrjob]

Mrjob is a Python 2.5+ package that assists the creation and running of Hadoop Streaming jobs

Mrjob is a Python 2.5+ package that assists the creation and running of hadoop Streaming jobs

Mrjob fully supports Amazon’s Elastic MapReduce (emr) service, which allows one to buy time on a Hadoop cluster on an hourly basis. It also works with personal Hadoop clusters.

Mrjob can be installed with pip:

pip install mrjob

307 questions

votes

1 answer

Accessing stream output from hdfs of MRjob

I'm trying to use a Python driver to run an iterative MRjob program. The exit criteria depend on a counter. The job itself seems to run. If I run a single iteration from the command line, I can then hadoop fs -cat /user/myname/myhdfsdir/part-00000…

python hadoop mapreduce hdfs mrjob

asked Mar 25 '18 at 04:10

tony_tiger

votes

2 answers

mrjob: Invalid bootstrap action path, must be a location in Amazon S3

I am on windows 7. I installed mrjob and when I run the example word_count file from the website, it works fine on the local machine. However, I get the error when attempting to run it on Amazon EMR. I even tested connecting to amazon s3 with just…

python amazon-emr mrjob

asked Apr 22 '14 at 07:24

KJW

14,248
44
128
236

votes

2 answers

How can python subprocess.Popen see select.poll and then later not? (select 'module' object has no attribute 'poll')

I'm using the (awesome) mrjob library from Yelp to run my python programs in Amazon's Elastic Map Reduce. It depends on subprocess in the standard python library. From my mac running python2.7.2, everything works as expected However, when I…

python subprocess mrjob

asked Jan 31 '12 at 21:53

user1181407

votes

4 answers

Running a job using hadoop streaming and mrjob: PipeMapRed.waitOutputThreads(): subprocess failed with code 1

Hey I'm fairly new to the world of Big Data. I came across this tutorial on http://musicmachinery.com/2011/09/04/how-to-process-a-million-songs-in-20-minutes/ It describes in detail of how to run MapReduce job using mrjob both locally and on…

python hadoop mapreduce hadoop-streaming mrjob

asked Jun 11 '13 at 05:50

Kiran Karanth

votes

2 answers

Numpy and Scipy with Amazon Elastic MapReduce

Using the mrjob to run python code on Amazon's Elastic MapReduce I have successfully found a way to upgrade the EMR image's numpy and scipy. Running from console the following commands work: tar -cvf py_bundle.tar mymain.py Utils.py…

python numpy scipy mrjob

asked Nov 11 '11 at 16:08

jtman

votes

5 answers

Multiple Inputs with MRJob

I'm trying to learn to use Yelp's Python API for MapReduce, MRJob. Their simple word counter example makes sense, but I'm curious how one would handle an application involving multiple inputs. For instance, rather than simply counting the words in a…

python mapreduce mrjob

asked Feb 15 '12 at 22:37

follyroof

2,712
1
25
26

votes

2 answers

How to get the name of input file in MRjob

I'm writing a map function using mrjob. My input will come from files in a directory on HDFS. Names of the files contain a small but crucial piece information that is not present in the files. Is there a way to learn (inside a map function) the…

python hadoop hadoop-streaming mrjob

asked Jul 11 '12 at 14:26

Bolo

10,774
5
38
58

votes

2 answers

mrjob: setup logging on EMR

I'm trying to use mrjob for running hadoop on EMR, and can't figure out how to setup logging (user generated logs in map/reduce steps) so I will be able to access them after the cluster is terminated. I have tried to setup logging using the logging…

python hadoop logging mapreduce mrjob

asked Sep 30 '14 at 14:18

Beka

votes

0 answers

Processing MongoDB data using Mrjob on Amazon EMR

I know that Mrjob uses Hadoop Streaming. I also know that there is a plugin for using MongoDB with Hadoop Streaming. However, I couldn't find any examples on bringing two together. Is this (at least in theory) possible? If so, are there any relevant…

python mongodb hadoop amazon-emr mrjob

asked Dec 06 '13 at 12:15

Eser Aygün

6,566
1
17
26

votes

2 answers

Python Dependency Management on EMR

i'm sending code to amazon's EMR via the mrjob/boto modules. i've got some external python dependencies (ie. numpy, boto, etc) and currently have to download the source of the python packages, and send them over as a tarball in the "python_archives"…

python virtualenv pip elastic-map-reduce mrjob

asked Jul 09 '13 at 21:24

follyroof

2,712
1
25
26

votes

1 answer

How can I iterately process all files under one directory using mrjob

I am using mrjob to process a batch of files and get some statistics. I know I can run mapreduce job on a single file, like python count.py < some_input_file > output But how can I feed a directory of files to the script? The file directory…

python hadoop mrjob

asked Dec 07 '12 at 11:28

Chunliang Lyu

1,702
19
34

votes

1 answer

How does one specify the input file for a runner from Python?

I am writing an external script to run a mapreduce job via the Python mrjob module on my laptop (not on Amazon Elastic Compute Cloud or any large cluster). I read from the mrjob documentation that I should use MRJob.make_runner() to run a mapreduce…

python mapreduce mrjob

asked Sep 24 '12 at 16:38

dangerChihuahua007

18,433
28
104
190

votes

2 answers

Run MRJob from IPython notebook

I'm trying to run mrjob example from IPython notebook from mrjob.job import MRJob class MRWordFrequencyCount(MRJob): def mapper(self, _, line): yield "chars", len(line) yield "words", len(line.split()) yield "lines", 1 def…

python mapreduce ipython-notebook mrjob

asked Jul 11 '14 at 15:17

szu

votes

1 answer

How do I put a print statement in mrjob code for debugging purposes?

How do I put a debug statement (like print) in reducer or mapper for mrjob. If I try to use print or sys.stderr.write(), I get an error TypeError: a bytes-like object is required, not 'str'

python mrjob

asked Nov 19 '19 at 03:55

AbhaySamant

votes

2 answers

Python Module Import Error "ImportError: No module named mrjob.job"

System: Mac OSX 10.6.5, Python 2.6 I try to run the python script below: from mrjob.job import MRJob class MRWordCounter(MRJob): def mapper(self, key, line): for word in line.split(): yield word, 1 def reducer(self, word,…

python module path mrjob

asked Nov 16 '10 at 23:07

worker1138

1,971
4
26
36

2 3

…

20 21 Next