Highest Voted 'parquet' Questions

152

votes

4 answers

What are the pros and cons of parquet format compared to other formats?

Characteristics of Apache Parquet are : Self-describing Columnar format Language-independent In comparison to Avro, Sequence Files, RC File etc. I want an overview of the formats. I have already read : How Impala Works with Hadoop File Formats ,…

asked Apr 24 '16 at 10:59

Ani Menon

23,084
13
81
107

108

votes

1 answer

What are the differences between feather and parquet?

Both are columnar (disk-)storage formats for use in data analysis systems. Both are integrated within Apache Arrow (pyarrow package for python) and are designed to correspond with Arrow as a columnar in-memory analytics layer. How do both formats…

python pandas parquet feather pyarrow

asked Jan 03 '18 at 18:48

Darkonaut

14,188
6
32
48

100

votes

7 answers

Avro vs. Parquet

I'm planning to use one of the hadoop file format for my hadoop related project. I understand parquet is efficient for column based query and avro for full scan or when we need all the columns data! Before I proceed and choose one of the file…

hadoop avro parquet

asked Mar 10 '15 at 06:19

Abhishek

6,114
12
46
78

89

votes

5 answers

Parquet vs ORC vs ORC with Snappy

I am running a few tests on the storage formats available with Hive and using Parquet and ORC as major options. I included ORC once with default compression and once with Snappy. I have read many a documents that state Parquet to be better in…

hadoop hive parquet snappy orc

asked Sep 03 '15 at 10:45

Rahul

2,084
3
18
29

87

votes

6 answers

How to read a Parquet file into Pandas DataFrame?

How to read a modestly sized Parquet data-set into an in-memory Pandas DataFrame without setting up a cluster computing infrastructure such as Hadoop or Spark? This is only a moderate amount of data that I would like to read in-memory with a simple…

python pandas dataframe parquet blaze

asked Nov 19 '15 at 20:30

Daniel Mahler

6,213
3
40
82

69

votes

11 answers

Inspect Parquet from command line

How do I inspect the content of a Parquet file from the command line? The only option I see now is $ hadoop fs -get my-path local-file $ parquet-tools head local-file | less I would like to avoid creating the local-file and view the file content…

parquet

asked Mar 21 '16 at 19:49

sds

52,616
20
134
226

44

votes

1 answer

Difference between Apache parquet and arrow

I'm looking into a way to speed up my memory intensive frontend vis app. I saw some people recommend Apache Arrow, while I'm looking into it, I'm confused about the difference between Parquet and Arrow. They are both columnized data structure.…

parquet apache-arrow

asked Jun 06 '19 at 07:25

Audrey

698
5
9

43

votes

4 answers

How to read partitioned parquet files from S3 using pyarrow in python

I looking for ways to read data from multiple partitioned directories from s3 using…

python parquet pyarrow fastparquet python-s3fs

asked Jul 13 '17 at 13:56

stormfield

1,294
1
11
24

41

votes

9 answers

How do I read a Parquet in R and convert it to an R DataFrame?

I'd like to process Apache Parquet files (in my case, generated in Spark) in the R programming language. Is an R reader available? Or is work being done on one? If not, what would be the most expedient way to get there? Note: There are Java and C++…

r apache-spark parquet sparkr

asked May 22 '15 at 17:05

metasim

4,424
3
42
68

40

votes

2 answers

Schema evolution in parquet format

Currently we are using Avro data format in production. Out of several good points using Avro, we know that it is good in schema evolution. Now we are evaluating Parquet format because of its efficiency while reading random columns. So before moving…

apache-spark hadoop data-warehouse avro parquet

asked Jun 05 '16 at 17:15

ToBeSparkShark

452
2
5
10

37

votes

5 answers

A comparison between fastparquet and pyarrow?

After some searching I failed to find a thorough comparison of fastparquet and pyarrow. I found this blog post (a basic comparison of speeds). and a github discussion that claims that files created with fastparquet do not support AWS-athena (btw…

python parquet dask pyarrow fastparquet

asked Jul 16 '18 at 12:00

moshevi

2,860
2
17
35

36

votes

8 answers

How do I get schema / column names from parquet file?

I have a file stored in HDFS as part-m-00000.gz.parquet I've tried to run hdfs dfs -text dir/part-m-00000.gz.parquet but it's compressed, so I ran gunzip part-m-00000.gz.parquet but it doesn't uncompress the file since it doesn't recognise the…

hadoop apache-pig hdfs parquet

asked Nov 24 '15 at 00:57

Super_John

1,325
2
12
25

36

votes

2 answers

Index in Parquet

I would like to be able to do a fast range query on a Parquet table. The amount of data to be returned is very small compared to the total size but because a full column scan has to be performed it is too slow for my use case. Using an index would…

indexing parquet

asked Nov 13 '14 at 13:02

Sjoerd van Hagen

363
1
3
5

35

votes

3 answers

How to partition and write DataFrame in Spark without deleting partitions with no new data?

I am trying to save a DataFrame to HDFS in Parquet format using DataFrameWriter, partitioned by three column values, like this: dataFrame.write.mode(SaveMode.Overwrite).partitionBy("eventdate", "hour", "processtime").parquet(path) As mentioned in…

apache-spark spark-dataframe partitioning parquet

asked Feb 18 '17 at 16:32

jaywilson

351
1
3
5

35

votes

3 answers

Reading DataFrame from partitioned parquet file

How to read partitioned parquet with condition as dataframe, this works fine, val dataframe = sqlContext.read.parquet("file:///home/msoproj/dev_data/dev_output/aln/partitions/data=jDD/year=2015/month=10/day=25/*") Partition is there for day=1 to…

scala apache-spark parquet spark-dataframe

asked Nov 11 '15 at 12:19

WoodChopper

3,639
5
24
45

Questions tagged [parquet]