What exactly am I doing wrong with my wordcount program? (pig)

Question

I'm very unfamiliar with pig, I wanted to attempt to make a sorted word count that took in no punctuation. I can DUMP D just fine, the issue comes when I attempt to DUMP E and get this error.

[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias E

A = load './SherlockHolmes.txt' using PigStorage(' ');
B = foreach A generate FLATTEN(REGEX_EXTRACT_ALL(LOWER((chararray)$0),'([A-Za-z]+)')) as word;
C = group B by word;
D = foreach C generate COUNT(B) AS counts, group AS word;
E = ORDER D BY counts DESC;
DUMP E;

What am I doing wrong?

What version of pig are you using? I just tried this in version 0.10, and it worked, but it gave incorrect output. — mr2ert, Jul 30 '13 at 17:48
For people who found this post when looking for [ERROR 1066: Unable to open iterator for alias](http://stackoverflow.com/questions/34495085/error-1066-unable-to-open-iterator-for-alias-in-pig-generic-solution) here is a [generic solution](http://stackoverflow.com/a/34495086/983722). — Dennis Jaheruddin, Dec 28 '15 at 14:56

score 3 · Accepted Answer · answered Jul 30 '13 at 18:46

For this answer I'll be using this as my sample input:

Hello, my ;name is Holmes.                                                       
This is a test, of a question on SO.                                            
Holmes, again.

When I'm writing a script for the first time, I find it really helpful to DESCRIBE and DUMP each step with some sample data so I know exactly what is happening. Doing that with your script shows:

A = load './SherlockHolmes.txt' using PigStorage(' ');
-- Schema for A unknown.
-- (Hello,,my,name,is,Holmes.)
-- (This,is,a,test,,of,a,question,on,SO.)
-- (Holmes,,again.)

So the output from A is a 'tuple' (really it is a schema) with an unknown number of values. Generally, if you don't know how may values are in a tuple, you should use a bag instead.

B = foreach A generate FLATTEN(REGEX_EXTRACT_ALL(LOWER((chararray)$0),'([A-Za-z]+)')) as word;
-- B: {word: bytearray}
-- ()
-- (this)
-- ()

When you use $0 you are referring not to all of the words in the schema, but rather the first word. So you are only applying the LOWER and REGEX_EXTRACT_ALL to the first word. Also, note that the FLATTEN operator is being done on a tuple, with does not produce the output that you want. You want to FLATTEN a bag.

C, D, and E all should work as you expect, so it all about massaging the data to get into a format that they can use.

Knowing this, you can do it like this:

-- Load in the line as a chararray so that TOKENIZE can convert it into a bag
A = load './tests/sh.txt' AS (foo:chararray);

B1 = FOREACH A GENERATE TOKENIZE(foo, ' ') AS tokens: {T:(word: chararray)} ;
-- Output from B1:
-- B1: {tokens: {T: (word: chararray)}}
-- ({(Hello,),(my),(;name),(is),(Holmes.)})
-- ({(This),(is),(a),(test,),(of),(a),(question),(on),(SO.)})
-- ({(Holmes,),(again.)})

-- Now inside a nested FOREACH we apply the appropriate transformations.
B2 = FOREACH B1 {

    -- Inside a nested FOREACH you can go over the contents of a bag
    cleaned = FOREACH tokens GENERATE 
              -- The .*? are needed to capture the leading and trailing punc.
              FLATTEN(REGEX_EXTRACT_ALL(LOWER(word),'.*?([a-z]+).*?')) as word ;

    -- Cleaned is a bag, so when we FLATTEN it we get one word per line
    GENERATE FLATTEN(cleaned) ;
}

So now the output of B2 is:

B2: {cleaned::word: bytearray}
(hello)
(my)
(name)
(is)
(holmes)
(this)
(is)
(a)
(test)
(of)
(a)
(question)
(on)
(so)
(holmes)
(again)

Which, when feed into C, D, and E, will give the desired output.

Let me know if you need me to clarify anything.

Anyway thank you for the help... I really would have never realized all of this — Chenab, Jul 30 '13 at 19:17
Simularish question for a different problem, why does this refuse to be ordered? — Chenab, Jul 30 '13 at 19:28
You mean the output of `E`? For me the output of `E` is sorted in descending order, so I'm not really sure. The only thing I can think of off the top of my head is that `counts` is being treated like a chararray instead of a long/int. — mr2ert, Jul 30 '13 at 19:40
After seeing the output of the `DESCRIBE C;`, the type shouldn't be causing the issue. This seems like it needs its own question. Make sure to provide the relevant `DESCRIBE`s and `DUMP`s for it! — mr2ert, Jul 30 '13 at 19:47

What exactly am I doing wrong with my wordcount program? (pig)

1 Answers1