How to Import/Load .csv file in PIG?

Question

lets suppose there is a text file tab limited (datetemp.txt) I want to load this text file in pig for processing but when I am typing below line its giving me error as :

grunt> inputfile= load '/training/pig/datetemp.txt' using PigStorage() As (EventID: chararray,eventdate: chararray,count:int);

grunt> dump inputfile;

2014-09-06 08:41:23,527 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: UNKNOWN 2014-09-06 08:41:23,544 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false 2014-09-06 08:41:23,548 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1 2014-09-06 08:41:23,548 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1 2014-09-06 08:41:23,551 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job 2014-09-06 08:41:23,551 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 2014-09-06 08:41:23,552 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - creating jar file Job2739171785773930333.jar 2014-09-06 08:42:39,608 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - jar file Job2739171785773930333.jar created 2014-09-06 08:42:39,612 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job 2014-09-06 08:42:39,619 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission. 2014-09-06 08:42:39,630 [Thread-12] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2014-09-06 08:42:39,891 [Thread-12] INFO org.apache.hadoop.mapred.JobClient - Cleaning up the staging area hdfs://192.168.195.130:8020/var/lib/hadoop-hdfs/cache/mapred/mapred/staging/training/.staging/job_201408292336_0009 2014-09-06 08:42:39,891 [Thread-12] ERROR org.apache.hadoop.security.UserGroupInformation - PriviledgedActionException as:training (auth:SIMPLE) cause:org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Input path does not exist: hdfs://192.168.195.130:8020/training/pig/datetemp.txt 2014-09-06 08:42:40,119 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2014-09-06 08:42:40,125 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job null has failed! Stop running all dependent jobs 2014-09-06 08:42:40,125 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2014-09-06 08:42:40,131 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backend error: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Input path does not exist: hdfs://192.168.195.130:8020/training/pig/datetemp.txt at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:285) at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:1014) at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1031) at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:172) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:943) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:896) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:896) at org.apache.hadoop.mapreduce.Job.submit(Job.java:531) at org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob.submit(ControlledJob.java:318) at org.apache.hadoop.mapreduce.lib.jobcontrol.JobControl.startReadyJobs(JobControl.java:238) at org.apache.hadoop.mapreduce.lib.jobcontrol.JobControl.run(JobControl.java:269) at java.lang.Thread.run(Thread.java:662) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$1.run(MapReduceLauncher.java:260) Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://192.168.195.130:8020/training/pig/datetemp.txt at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:231) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:248) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:273) ... 15 more

2014-09-06 08:42:40,131 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed! 2014-09-06 08:42:40,135 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:

HadoopVersion PigVersion UserId StartedAt FinishedAt Features 2.0.0-cdh4.1.1 0.10.0-cdh4.1.1 training 2014-09-06 08:41:23 2014-09-06 08:42:40 UNKNOWN

Failed!

Failed Jobs: JobId Alias Feature Message Outputs N/A inputfile MAP_ONLY Message: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Input path does not exist: hdfs://192.168.195.130:8020/training/pig/datetemp.txt at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:285) at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:1014) at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1031) at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:172) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:943) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:896) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:896) at org.apache.hadoop.mapreduce.Job.submit(Job.java:531) at org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob.submit(ControlledJob.java:318) at org.apache.hadoop.mapreduce.lib.jobcontrol.JobControl.startReadyJobs(JobControl.java:238) at org.apache.hadoop.mapreduce.lib.jobcontrol.JobControl.run(JobControl.java:269) at java.lang.Thread.run(Thread.java:662) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$1.run(MapReduceLauncher.java:260) Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://192.168.195.130:8020/training/pig/datetemp.txt at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:231) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:248) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:273) ... 15 more hdfs://192.168.195.130:8020/tmp/temp-1004538676/tmp1582688785,

Input(s): Failed to read data from "/training/pig/datetemp.txt"

Output(s): Failed to produce result in "hdfs://192.168.195.130:8020/tmp/temp-1004538676/tmp1582688785"

Counters: Total records written : 0 Total bytes written : 0 Spillable Memory Manager spill count : 0 Total bags proactively spilled: 0 Total records proactively spilled: 0

Job DAG: null

2014-09-06 08:42:40,135 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed! 2014-09-06 08:42:40,142 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias inputfile Details at logfile: /home/training/pig_1410006833865.log

Please help me here..!!

For people who found this post when looking for [ERROR 1066: Unable to open iterator for alias](http://stackoverflow.com/questions/34495085/error-1066-unable-to-open-iterator-for-alias-in-pig-generic-solution) here is a [generic solution](http://stackoverflow.com/a/34495086/983722). — Dennis Jaheruddin, Dec 28 '15 at 15:06

score 2 · Answer 1 · answered Sep 01 '14 at 04:06

2

PigStorage is case sensitive. Use PigStorage and not pigstorage.

answered Sep 01 '14 at 04:06

Gaurav Phapale

949
1
7
21

@Prix, if my answer has solved your question, please mark it answered. – Gaurav Phapale Sep 03 '14 at 05:49
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator – Prix Sep 06 '14 at 08:39
:) thats a different, irrelevant error. You might want to give more details about the error. – Gaurav Phapale Sep 06 '14 at 11:27
1

I see the error "Input path does not exist: hdfs://192.168.195.130:8020/training/pig/datetemp.txt". Verify if the file actually exists. – Gaurav Phapale Sep 06 '14 at 14:36

score 0 · Answer 2 · answered Apr 16 '15 at 21:03

0

Your question headliner said you were trying to load a CSV file. For that, I've had good luck with using org.apache.pig.piggybank.storage.CSVExcelStorage() in my LOAD statements as demonstrated at https://martin.atlassian.net/wiki/x/WYBmAQ.

answered Apr 16 '15 at 21:03

Lester Martin

191
5

score 0 · Answer 3 · answered Jun 17 '15 at 19:01

Why don't you write PigStorage('\t') as you have mentioned already you have tab delimited file instead of PigStorage()

Mentioned code -

grunt> inputfile= load '/training/pig/datetemp.txt' using PigStorage() As (EventID: chararray,eventdate: chararray,count:int);

May be this might solve your problem.

let me know if it is something else.

score 0 · Answer 4 · answered Jun 18 '15 at 04:04

0

hdfs://192.168.195.130:8020/training/pig/datetemp.txt

file wat not found in your hdfs!! make sure the input file to be placed in the above location.

answered Jun 18 '15 at 04:04

karthik

551
7
21

Naga · Answer 5 · 2015-09-01T20:56:38.190

0

Have you checked whether the input path exists?

Try:

fs -ls /training/pig/ in Grunt Shell

if it displays datetemp.txt in the list then it will work otherwise give proper input path

edited Sep 01 '15 at 20:56

answered Sep 01 '15 at 20:23

Naga

989
6
19

score 0 · Answer 6 · edited Sep 02 '15 at 17:33

0

Log is telling the ERROR clearly.

org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Input path does not exist: hdfs://192.168.195.130:8020/training/pig/datetemp.txt

Can you check the file exists in HDFS or not? You can also check your pig is running in mapreduce mode or local mode.

edited Sep 02 '15 at 17:33

Naga

989
6
19

answered Sep 02 '15 at 13:34

Narasimha

11
3

score 0 · Answer 7 · answered Sep 24 '15 at 15:00

You can specify ',' in PigStorage Class to read CSV file.

Query looks like :

grunt> inputfile= load '/training/pig/datetemp.txt' using PigStorage(',') As (EventID: chararray,eventdate: chararray,count:int);

grunt> dump inputfile;

And make sure that you have file '/training/pig/datetemp.txt' on HDFS. To test run : hadoop fs -ls /training/pig/datetemp.txt

How to Import/Load .csv file in PIG?

7 Answers7