Pig: Python UDF to search text for a list of keywords/strings

Question

I have two files, one with a list of keywords/strings:

blue fox
the
lazy dog
orange
of
file

Another, with text:

The blue fox jumped
over the lazy dog
this file has nothing important
lines repeat
this line does not match

I want to take the list of strings in the first file and find lines from second file that match any of the strings from the first. So I wrote a Pig script with a Python UDF:

register match.py using jython as match;
A = LOAD 'words.txt' AS (word:chararray);
B = LOAD 'text.txt' AS (line:chararray);
C = GROUP A ALL;
D = FOREACH B generate match.match(C.$1,line);
dump D;

#match.py
@outputSchema("str:chararray")
def match(wordlist,line):
    linestr = str(line)
    for word in wordlist:
            wordstr = str(word)
            if re.search(wordstr,linestr):
                    return line

Ends in error:

"2014-04-01 06:22:34,775 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias D. Backend error : Error executing function"

Detailed Error log:

Backend error message
---------------------
org.apache.pig.backend.executionengine.ExecException: ERROR 0: Error executing function
        at org.apache.pig.scripting.jython.JythonFunction.exec(JythonFunction.java:120)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:337)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:434)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:340)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:372)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:297)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
        at o

Pig Stack Trace
---------------
ERROR 1066: Unable to open iterator for alias D. Backend error : Error executing function

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias D. Backend error : Error executing function
        at org.apache.pig.PigServer.openIterator(PigServer.java:828)
        at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:696)
        at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:320)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:194)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170)
        at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
        at org.apache.pig.Main.run(Main.java:538)
        at org.apache.pig.Main.main(Main.java:157)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Error executing function
        at org.apache.pig.scripting.jython.JythonFunction.exec(JythonFunction.java:120)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:337)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:434)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:340)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:372)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:297)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
================================================================================

For people who found this post when looking for [ERROR 1066: Unable to open iterator for alias](http://stackoverflow.com/questions/34495085/error-1066-unable-to-open-iterator-for-alias-in-pig-generic-solution) here is a [generic solution](http://stackoverflow.com/a/34495086/983722). — Dennis Jaheruddin, Dec 28 '15 at 14:49

score 1 · Accepted Answer · answered Apr 02 '14 at 21:53

I suspect the "re" module isn't available to jython in my CDH4.x cluster. I did not spend much time on the python UDF. I solved it by writing a Java UDF. Pardon my Java since I am a n00b, may not be the most efficient or most pretty Java code (and some bugs in there, I am sure):

package pigext;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
import java.io.IOException;
import java.util.*;
import org.apache.pig.FilterFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.DataType;

public class matchList extends EvalFunc<String> {

  public String exec(Tuple input) throws IOException {
try {
        String line = (String)input.get(0);
        DataBag bag = (DataBag)input.get(1);
        Iterator it = bag.iterator();
        String output = "";
        while (it.hasNext()){
                Tuple t = (Tuple)it.next();
                if (t != null && t.size() > 0 && t.get(0) != null && line != null ) 
                        {
                          String cmd = t.get(0).toString();
                          if ( line.toLowerCase().matches(cmd.toLowerCase()) ) {
                                return (line + "," + cmd);
                                }                         
                        }
         }
        return output;
        } catch (Exception e) {
           throw new IOException("Failed to process row", e);
        }

} }

The way to use it is have a file filled with regex, one per line, that you want to search for and obviously your target text file. So a regex file "wordstext.txt" as:

.*?this +blah.*?

And, your text file,text.txt, is:

this blah starts with blah
this    blah has way too many spaces
that won't match
thisblahshouldnotmatch
thisblah should not match either
the line here is this blah
line here has this blah in the middle
line here has this    blah with extra spaces
only has blah
only has this

The pig script would be:

REGISTER pigext.jar;
A = LOAD 'wordstest.txt' AS (cmd:chararray);
B = LOAD 'text.txt' AS (line:chararray);
C = GROUP A ALL;
D = FOREACH B generate pigext.matchList(line,C.$1);
dump D;

The UDF is probably better written to work as extending the FilterFunc so you don't have to do a generate (especially if you have a long list of fields). Left as an exercise to the reader or maybe when I find time to re-do the UDF :) — Joe Nate, Apr 02 '14 at 22:01

Pig: Python UDF to search text for a list of keywords/strings

1 Answers1