Import text files to pig through python UDF

Question

I'm trying to load files to pig while use python udf, i've tried two ways:

• (myudf1, sample1.pig): try to read the file from python, the file is located on my client server.

• (myudf2, sample2.pig): load file from hdfs to grunt shell first, then pass it as a parameter to python udf.

myudf1.py

from __future__ import with_statement
def get_words(dir):
    stopwords=set()
    with open(dir) as f1:
        for line1 in f1:
            stopwords.update([line1.decode('ascii','ignore').split("\n")[0]])
    return stopwords

stopwords=get_words("/home/zhge/uwc/mappings/english_stop.txt")

@outputSchema("findit: int")
def findit(stp):
    stp=str(stp)
    if stp in stopwords:
        return 1
    else:
        return 0

sample1.pig:

REGISTER '/home/zhge/uwc/scripts/myudf1.py' USING jython as pyudf;
item_title = load '/user/zhge/data/item_title_sample/000000_0' USING PigStorage(',')  AS (title:chararray);

T = limit item_title 1;
S = FOREACH T GENERATE pyudf.findit(title);
DUMP S

I get: IOError: (2, 'No such file or directory', '/home/zhge/uwc/mappings/english_stop.txt')

For solution 2:

myudf2:

def get_wordlists(wordbag):
    stopwords=set()
    for t in wordbag:
        stopwords.update(t.decode('ascii','ignore'))
    return stopwords


@outputSchema("findit: int")
def findit(stopwordbag, stp):
    stopwords=get_wordlists(stopwordbag)
    stp=str(stp)
    if stp in stopwords:
        return 1
    else:
        return 0

Sample2.pig

REGISTER '/home/zhge/uwc/scripts/myudf2.py' USING jython as pyudf;

stops = load '/user/zhge/uwc/mappings/stopwords.txt' AS (stop_w:chararray);
-- this step works fine and i can see the "stops" obejct is loaded to pig 
item_title = load '/user/zhge/data/item_title_sample/000000_0' USING PigStorage(',')  AS (title:chararray);
T = limit item_title 1;
S = FOREACH T GENERATE pyudf.findit(stops.stop_w, title);
DUMP S;

Then I got: ERROR org.apache.pig.tools.grunt.Grunt -ERROR 1066: Unable to open iterator for alias S. Backend error : Scalar has more than one row in the output. 1st : (a), 2nd :(as

For people who found this post when looking for [ERROR 1066: Unable to open iterator for alias](http://stackoverflow.com/questions/34495085/error-1066-unable-to-open-iterator-for-alias-in-pig-generic-solution) here is a [generic solution](http://stackoverflow.com/a/34495086/983722). — Dennis Jaheruddin, Dec 28 '15 at 15:18

score 0 · Answer 1 · answered May 28 '15 at 17:51

Your second example should work. Though you LIMITed the wrong expression -- it should be on the stops relationship. Therefore it should be:

stops = LOAD '/user/zhge/uwc/mappings/stopwords.txt' AS (stop_w:chararray);

item_title = LOAD '/user/zhge/data/item_title_sample/000000_0' USING PigStorage(',') AS (title:chararray);
T = LIMIT stops 1;
S = FOREACH item_title GENERATE pyudf.findit(T.stop_w, title);

However, since it looks like you need to process all of the stop words first this will not be enough. You'll need to do a GROUP ALL and then pass the results to your get_wordlist function instead:

stops = LOAD '/user/zhge/uwc/mappings/stopwords.txt' AS (stop_w:chararray);

item_title = LOAD '/user/zhge/data/item_title_sample/000000_0' USING PigStorage(',') AS (title:chararray);
T = FOREACH (GROUP stops ALL) GENERATE pyudf.get_wordlists(stops) AS ready;
S = FOREACH item_title GENERATE pyudf.findit(T.ready, title);

You'll have to update your UDF to accept a list of dicts though for this method to work.

Import text files to pig through python UDF

1 Answers1