For this answer I'll be using this as my sample input:
Hello, my ;name is Holmes.
This is a test, of a question on SO.
Holmes, again.
When I'm writing a script for the first time, I find it really helpful to DESCRIBE
and DUMP
each step with some sample data so I know exactly what is happening. Doing that with your script shows:
A = load './SherlockHolmes.txt' using PigStorage(' ');
-- Schema for A unknown.
-- (Hello,,my,name,is,Holmes.)
-- (This,is,a,test,,of,a,question,on,SO.)
-- (Holmes,,again.)
So the output from A
is a 'tuple' (really it is a schema) with an unknown number of values. Generally, if you don't know how may values are in a tuple, you should use a bag instead.
B = foreach A generate FLATTEN(REGEX_EXTRACT_ALL(LOWER((chararray)$0),'([A-Za-z]+)')) as word;
-- B: {word: bytearray}
-- ()
-- (this)
-- ()
When you use $0
you are referring not to all of the words in the schema, but rather the first word. So you are only applying the LOWER
and REGEX_EXTRACT_ALL
to the first word. Also, note that the FLATTEN
operator is being done on a tuple, with does not produce the output that you want. You want to FLATTEN
a bag.
C
, D
, and E
all should work as you expect, so it all about massaging the data to get into a format that they can use.
Knowing this, you can do it like this:
-- Load in the line as a chararray so that TOKENIZE can convert it into a bag
A = load './tests/sh.txt' AS (foo:chararray);
B1 = FOREACH A GENERATE TOKENIZE(foo, ' ') AS tokens: {T:(word: chararray)} ;
-- Output from B1:
-- B1: {tokens: {T: (word: chararray)}}
-- ({(Hello,),(my),(;name),(is),(Holmes.)})
-- ({(This),(is),(a),(test,),(of),(a),(question),(on),(SO.)})
-- ({(Holmes,),(again.)})
-- Now inside a nested FOREACH we apply the appropriate transformations.
B2 = FOREACH B1 {
-- Inside a nested FOREACH you can go over the contents of a bag
cleaned = FOREACH tokens GENERATE
-- The .*? are needed to capture the leading and trailing punc.
FLATTEN(REGEX_EXTRACT_ALL(LOWER(word),'.*?([a-z]+).*?')) as word ;
-- Cleaned is a bag, so when we FLATTEN it we get one word per line
GENERATE FLATTEN(cleaned) ;
}
So now the output of B2
is:
B2: {cleaned::word: bytearray}
(hello)
(my)
(name)
(is)
(holmes)
(this)
(is)
(a)
(test)
(of)
(a)
(question)
(on)
(so)
(holmes)
(again)
Which, when feed into C
, D
, and E
, will give the desired output.
Let me know if you need me to clarify anything.