0

I know redshift and redshift spectrum doesn't support nested type, but I want to know is there any trick that we can bypass that limitation and query our nested data in S3 with Redshift Spectrum? In this post the guy shows how we can do it for JSON files, but it's not the same for Parquet. Do we have any other trick that can be applied on Parquet file?

The actual Schema is something like this: (extracted by AWS-Glue crawler)

CREATE EXTERNAL TABLE `parquet_nested`(
  `event_time` string, 
  `event_id` string, 
  `user` struct<ip_address:string,id:string,country:string>, 
  `device` struct<platform:string,device_id:string,user_agent:string>
  )
PARTITIONED BY ( 
  `partition_0` string, 
  `partition_1` string, 
  `partition_2` string, 
  `partition_3` string, 
  `partition_4` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://...'
Am1rr3zA
  • 5,777
  • 15
  • 66
  • 112

1 Answers1

1

@Am1rr3zA Now, RedShift spectrum supports querying nested data set. It supports not only JSON but also compression formats, like parquet, orc. Here, is the reference sample from AWS

I have created external tables pointing to parquet files in my s3 bucket. So it's possible.

Give this script a try

CREATE EXTERNAL TABLE spectrum.parquet_nested (
   event_time varchar(20),
   event_id varchar(20),
   user 
 struct<ip_address:varchar(20),id:varchar(20),country:varchar(20)>,
   device 
 struct<platform:varchar(20),device_id:varchar(20),user_agent:varchar(20)>
    )
    STORED AS PARQUET
    LOCATION 's3://BUCKETNAME/parquetFolder/';

Hope, this saves your trick adventure :)

SunSmiles
  • 186
  • 9