2

I have the below data.

•   PRT_Edit & Set Shopping Cart in Retail

•   PRT_Confirm Shopping Cart for Goods

o   PRT-Ret_Process Supplier Invoice

o   PRT-Web_Overview of Orders

o   PRT_Update Outfirst Agreement

PRT_Axn_-Purchase and Requisition

The data has special symbols, tab space and spaces. I want to extract only the text part from this data as:

PRT_Edit & Set Shopping Cart in Retail

PRT_Confirm Shopping Cart for Goods

PRT-Ret_Process Supplier Invoice

PRT-Web_Overview of Orders

PRT_Update Outfirst Agreement

I have tried using REGEX_EXTRACT_ALL in Pig Script as below but it does not work.

PRT = LOAD '/DATA' USING TEXTLOADER() AS (LINE:CHARARRAY);

Cleansed = FOREACH PRT GENERATE REGEX_EXTRACT_ALL(LINE,'[A-Z]*') AS DATA;

When I try dumping Cleansed, it does not show any data. Can any one please help.

Jahar tyagi
  • 91
  • 11
  • Try `Cleansed = FOREACH PRT GENERATE FLATTEN(REGEX_EXTRACT_ALL(LINE, '^[^a-zA-Z]*([a-zA-Z].*[a-zA-Z])[^a-zA-Z]*$')) AS (FIELD1:chararray), LINE;` – Wiktor Stribiżew Oct 27 '15 at 11:00
  • Thanks for your comment Stribizhev. But this only removes the special symbols from input data. bullets are still there in output. – Jahar tyagi Oct 27 '15 at 12:26
  • Thanks Stribizhev. I have figured out. Your script was correct actually the format of bullets got changed when I transferred the text file from a windows machine to CentOS. Thanks for the support. Can you pls suggest some good reference material or website to learn the REGEX in detail. Actually I want to understand the use of the pattern symbols like ^*$ etc. – Jahar tyagi Oct 27 '15 at 12:43

1 Answers1

1

You can use

Cleansed = FOREACH PRT GENERATE FLATTEN(
      REGEX_EXTRACT_ALL(LINE, '^[^a-zA-Z]*([a-zA-Z].*[a-zA-Z])[^a-zA-Z]*$'))
       AS (FIELD1:chararray), LINE;

The regex matches the following:

  • ^ - start of string
  • [^a-zA-Z]* - 0 or more characters other than the Latin letters in the character class
  • ([a-zA-Z].*[a-zA-Z]) - a capturing group that we'll reference to as FIELD1 later, matching:
    • [a-zA-Z].*[a-zA-Z] - a Latin letter, then any characters, as many as possible (the greedy * is used, not *? lazy one)
  • [^a-zA-Z]* - 0 or more characters other than the Latin letters
  • $ - end of string
Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397
  • The thing is that the string is passed to the method that requires a full string match, so I guess you can even omit `^` and `$` here. To learn more about regex, I can suggest doing all lessons at [regexone.com](http://regexone.com/), reading through [regular-expressions.info](http://www.regular-expressions.info), [regex SO tag description](http://stackoverflow.com/tags/regex/info) (with many other links to great online resources), and the community SO post called [What does the regex mean](http://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean). – Wiktor Stribiżew Oct 27 '15 at 12:47
  • Thank you so much stribizhev. It was very helpful. – Jahar tyagi Oct 27 '15 at 13:16