0

I would like to convert the csv data files that are right now sitting on Amazon S3 into Parquet format using Amazon Athena and push them back to Amazon S3 without taking any help from Amazon EMR. Is this possible to do it? Has anyone experienced something similar?

Teja
  • 11,878
  • 29
  • 80
  • 137

1 Answers1

1

Amazon Athena can query data but cannot convert data formats.

You can use Amazon EMR to Convert to Columnar Formats. The steps are:

  • Create an external table pointing to the source data
  • Create a destination external table with STORED AS PARQUET
  • INSERT OVERWRITE <destination_table> SELECT * FROM <source_table>
John Rotenstein
  • 165,783
  • 13
  • 223
  • 298
  • Hi John - I am trying to do this using Amazon EMR ( Hive steps option ) but the problem is my cluster always remains in the Starting status. – Teja Feb 20 '18 at 17:29
  • What is the status of the slave nodes in the EMR console? Check that your [EC2 Service Limits](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-resource-limits.html) allow you to launch your desired Instance Type. – John Rotenstein Feb 21 '18 at 00:05
  • The issue was with my VPC and DHCP options set. They both dont map with each other correctly. Still working with my admin to resolve this and then need to try creating the cluster again. – Teja Feb 21 '18 at 14:56
  • I just wanted to confirm is it possible to create a single partition file using this solution. Basically I have data from the last 3 years around 3 TB all of them stored at the hourly level in S3, the highest file size is 10GB ( for single iteration ). So I want to generate one parquet file for each other rather than the parts. – Shivkumar Mallesappa Nov 07 '19 at 10:47
  • @ShivkumarMallesappa Please create a new Question rather than asking via a comment on an old question. – John Rotenstein Nov 07 '19 at 11:43