Questions tagged [parquet]

Apache Parquet is a columnar storage format for Hadoop.

Apache Parquet is a columnar storage format for Hadoop.

Parquet was created to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model, or programming language.

References:

2777 questions
152
votes
4 answers

What are the pros and cons of parquet format compared to other formats?

Characteristics of Apache Parquet are : Self-describing Columnar format Language-independent In comparison to Avro, Sequence Files, RC File etc. I want an overview of the formats. I have already read : How Impala Works with Hadoop File Formats ,…
Ani Menon
  • 23,084
  • 13
  • 81
  • 107
108
votes
1 answer

What are the differences between feather and parquet?

Both are columnar (disk-)storage formats for use in data analysis systems. Both are integrated within Apache Arrow (pyarrow package for python) and are designed to correspond with Arrow as a columnar in-memory analytics layer. How do both formats…
Darkonaut
  • 14,188
  • 6
  • 32
  • 48
100
votes
7 answers

Avro vs. Parquet

I'm planning to use one of the hadoop file format for my hadoop related project. I understand parquet is efficient for column based query and avro for full scan or when we need all the columns data! Before I proceed and choose one of the file…
Abhishek
  • 6,114
  • 12
  • 46
  • 78
89
votes
5 answers

Parquet vs ORC vs ORC with Snappy

I am running a few tests on the storage formats available with Hive and using Parquet and ORC as major options. I included ORC once with default compression and once with Snappy. I have read many a documents that state Parquet to be better in…
Rahul
  • 2,084
  • 3
  • 18
  • 29
87
votes
6 answers

How to read a Parquet file into Pandas DataFrame?

How to read a modestly sized Parquet data-set into an in-memory Pandas DataFrame without setting up a cluster computing infrastructure such as Hadoop or Spark? This is only a moderate amount of data that I would like to read in-memory with a simple…
Daniel Mahler
  • 6,213
  • 3
  • 40
  • 82
69
votes
11 answers

Inspect Parquet from command line

How do I inspect the content of a Parquet file from the command line? The only option I see now is $ hadoop fs -get my-path local-file $ parquet-tools head local-file | less I would like to avoid creating the local-file and view the file content…
sds
  • 52,616
  • 20
  • 134
  • 226
44
votes
1 answer

Difference between Apache parquet and arrow

I'm looking into a way to speed up my memory intensive frontend vis app. I saw some people recommend Apache Arrow, while I'm looking into it, I'm confused about the difference between Parquet and Arrow. They are both columnized data structure.…
Audrey
  • 698
  • 5
  • 9
43
votes
4 answers

How to read partitioned parquet files from S3 using pyarrow in python

I looking for ways to read data from multiple partitioned directories from s3 using…
stormfield
  • 1,294
  • 1
  • 11
  • 24
41
votes
9 answers

How do I read a Parquet in R and convert it to an R DataFrame?

I'd like to process Apache Parquet files (in my case, generated in Spark) in the R programming language. Is an R reader available? Or is work being done on one? If not, what would be the most expedient way to get there? Note: There are Java and C++…
metasim
  • 4,424
  • 3
  • 42
  • 68
40
votes
2 answers

Schema evolution in parquet format

Currently we are using Avro data format in production. Out of several good points using Avro, we know that it is good in schema evolution. Now we are evaluating Parquet format because of its efficiency while reading random columns. So before moving…
ToBeSparkShark
  • 452
  • 2
  • 5
  • 10
37
votes
5 answers

A comparison between fastparquet and pyarrow?

After some searching I failed to find a thorough comparison of fastparquet and pyarrow. I found this blog post (a basic comparison of speeds). and a github discussion that claims that files created with fastparquet do not support AWS-athena (btw…
moshevi
  • 2,860
  • 2
  • 17
  • 35
36
votes
8 answers

How do I get schema / column names from parquet file?

I have a file stored in HDFS as part-m-00000.gz.parquet I've tried to run hdfs dfs -text dir/part-m-00000.gz.parquet but it's compressed, so I ran gunzip part-m-00000.gz.parquet but it doesn't uncompress the file since it doesn't recognise the…
Super_John
  • 1,325
  • 2
  • 12
  • 25
36
votes
2 answers

Index in Parquet

I would like to be able to do a fast range query on a Parquet table. The amount of data to be returned is very small compared to the total size but because a full column scan has to be performed it is too slow for my use case. Using an index would…
Sjoerd van Hagen
  • 363
  • 1
  • 3
  • 5
35
votes
3 answers

How to partition and write DataFrame in Spark without deleting partitions with no new data?

I am trying to save a DataFrame to HDFS in Parquet format using DataFrameWriter, partitioned by three column values, like this: dataFrame.write.mode(SaveMode.Overwrite).partitionBy("eventdate", "hour", "processtime").parquet(path) As mentioned in…
jaywilson
  • 351
  • 1
  • 3
  • 5
35
votes
3 answers

Reading DataFrame from partitioned parquet file

How to read partitioned parquet with condition as dataframe, this works fine, val dataframe = sqlContext.read.parquet("file:///home/msoproj/dev_data/dev_output/aln/partitions/data=jDD/year=2015/month=10/day=25/*") Partition is there for day=1 to…
WoodChopper
  • 3,639
  • 5
  • 24
  • 45
1
2 3
99 100