0

I've tried change wordcount example using mrjob. My structure project is:

├── input_text.txt
├── store_xml_dir
│   ├── xml_file.xml
│   └── xml_parse.py
└── wordcount.py

and content of wordcount.py is:

import os
import sys
cwdir = os.path.dirname(__file__)
sys.path.append(cwdir)
sys.path.append(os.path.join(cwdir, "store_xml_dir"))

import xml_parse
# print dir(xml_parse) <- it works here if i'd comment the rest code

from mrjob.job import MRJob


class MRWordFrequencyCount(MRJob):

    def mapper(self, _, line):
        getxml = xml_parse.GetXML()
        print '>>>', getxml.get_strings()

        yield "chars", len(line)
        yield "words", len(line.split())
        yield "lines", 1

    def reducer(self, key, values):
        yield key, sum(values)

if __name__ == '__main__':
    MRWordFrequencyCount.run()

When i run, i've got error: ImportError: No module named xml_parse. Why python can not import xml_parse in this case?

Trigges
  • 1
  • 4
  • You need to put `store_xml_dir` in other nodes and add it to `pythonpath`. mrjob does not export your code to other nodes. – Mehraban Oct 19 '14 at 10:29
  • @SAM I've appended to `pythonpath` by using `sys.path.append(os.path.join(cwdir, "store_xml_dir"))` – Trigges Oct 19 '14 at 10:35
  • It seems mrjob does not execute that line of code on task trackers. Maybe you need to do it manually. – Mehraban Oct 19 '14 at 10:38
  • @SAM But when run in local. The code isn't works. – Trigges Oct 19 '14 at 10:48
  • spread xml directory on all hosts, then [add it permanently to pythonpath](http://stackoverflow.com/questions/3402168/permanently-add-a-directory-to-pythonpath). – Mehraban Oct 19 '14 at 10:54
  • @SAM i've tried export `pythonpath` by command: `export PYTHONPATH=$PYTHONPATH:`pwd`/store_xml_dir` then it works. But i don't know why `sys.path.append` isn't work in this case? – Trigges Oct 19 '14 at 11:08
  • I updated my answer. I was wrong about it and mrjob can handle what you was trying. – Mehraban Oct 19 '14 at 14:33

2 Answers2

0

mrjob does not exports your none mapreduce code for you, You need to export it yourself. but you need to explicitly tell it to do so in configs. (take a look at their documentation)

You can run your code by adding these config parameters:

--python-archive=store_xml_dir
Mehraban
  • 2,747
  • 3
  • 32
  • 56
0

I've found that importing the modules within the member function solves this issue for me.

def mapper(self, _, line):
    import xml_parse
    ...
Bill
  • 55
  • 5