5

Currently I am using Amazon Redshift as well as Amazon S3 to store data. Now I want to use Spectrum to improve performance but confused in how to use it properly.

If I am using SQL workbench can I create external schema from same or I need to create it from AWS console or Athena.?

Do I need to have Athena for a specific region.? Is it possible to use spectrum without Athena.?

Now if I try to create external schema through SQL workbench it was throwing an error "CREATE EXTERNAL SCHEMA is not enabled" How can enable this..?

Please help if someone had used Spectrum and let me know detailed steps to use spectrum.

John Rotenstein
  • 165,783
  • 13
  • 223
  • 298
Pratik Rawlekar
  • 277
  • 4
  • 12
  • 1
    Have you read the "Getting Started with Redshift Spectrum" documentation? http://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum.html – Nathan Griffiths Jun 20 '17 at 07:48
  • Yes @Nathan, I read this document but I am confused in whether to use Athena or can we do the same with Redshift..? – Pratik Rawlekar Jun 20 '17 at 09:10

1 Answers1

5

Redshift Spectrum requires an external data catalog that contains the definition of the table. It is this data catalog that contains the reference to the files in S3, rather than the external table definition in Redshift. This data catalog can be defined in Elastic MapReduce as a Hive Catalog (good if you have an existing EMR deployment) or in Athena (good if you don't have EMR or don't want to get into managing Hadoop). The Athena route can be managed fully by Redshift, if you wish.

It looks to me like your issue is one of four things. Either:

  1. Your Redshift cluster is not in an AWS region that currently supports Athena and Spectrum.
  2. Your Redshift cluster version doesn't support Spectrum yet (1.0.1294 or later).
  3. Your IAM policies don't allow Redshift control over Athena.
  4. You're not using the CREATE EXTERNAL DATABASE IF NOT EXISTS parameter on your CREATE EXTERNAL SCHEMA statement.

To allow Redshift to manage Athena you'll need to attach an IAM policy to your Redshift cluster that allows it Full Control over Athena, as well as Read access to the S3 bucket containing your data.

Once that's in place, you can create your external schema as you have been already, ensuring that the CREATE EXTERNAL DATABASE IF NOT EXISTS argument is also passed. This makes sure that the external database is created in Athena if you don't have a pre-existing configuration: http://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum-create-external-table.html

Finally, run your CREATE EXTERNAL TABLE statement, which will transparently create the table metadata in the Athena data catalog: http://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-tables.html

GShenanigan
  • 5,119
  • 5
  • 36
  • 46
  • Thanks @GShenanigan. Is it possible to create External schema through my SQL workbench with Redshift credentials and having ARN while querying..? or is it necessary to do Athena configuration separately for this..? – Pratik Rawlekar Jun 20 '17 at 10:13
  • Sorry, I'm not quite following - do you mean the ARN for S3? And defining this when you query the data instead of using the CREATE EXTERNAL TABLE statement? – GShenanigan Jun 20 '17 at 10:16
  • To create External schema we will execute following command with ARN url for athorization : create external schema spectrum from data catalog database 'spectrumdb' iam_role 'arn:aws:iam::636663244196:user/username' create external database if not exists; – Pratik Rawlekar Jun 20 '17 at 10:25
  • Okay, that should work fine. Does your IAM user have Full control over Athena and read access to S3? What AWS region are you in? – GShenanigan Jun 20 '17 at 15:23
  • I am able to create external schema but If I want to pass .csv file while creating external table, how can I achieve this .? What should be the parameter values for below option, row format delimited , fields terminated by '\t' , stored as textfile , table properties ('numRows'='172000'); – Pratik Rawlekar Jun 29 '17 at 08:04
  • It depends how your CSV file is set up: are your fields quoted, is it actually commas that you use as your delimiters? Table properties are optional hints to help Redshift optimise your queries that involve this table. Check out the documentation here for the full details: http://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_TABLE.html – GShenanigan Jun 30 '17 at 15:39