Questions tagged [aws-glue]

AWS Glue is a fully managed ETL (extract, transform, and load) service that can categorize your data, clean it, enrich it, and move it between various data stores. AWS Glue consists of a central data repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python code, and a scheduler that handles dependency resolution, job monitoring, and retries. AWS Glue is serverless, so there's no infrastructure to manage.

AWS Glue consists of a number of components components:

A data catalog (implementing functionality of a Hive Metastore) across AWS data sources, primarily S3, but also any JDBC data source on AWS including Amazon RDS and Amazon Redshift
Crawlers, which perform data classification and schema discovery across S3 data and register data with the data catalog
A distributed data processing framework which extends PySpark with functionality for increased schema flexibility.
Code generation tools to template and bootstrap data processing scripts
Scheduling for crawlers and data processing scripts
Serverless development and execution of scripts in an Apache Spark (2.x) environment.

Data registered in the AWS Glue Data Catalog is available to many AWS Services, including

Amazon Redshift Spectrum
EMR (Hadoop, Hive, HBase, Presto, Spark, Impala, etc.)
Amazon Athena
AWS Glue scripts

2337 questions

votes

4 answers

How to create AWS Glue table where partitions have different columns? ('HIVE_PARTITION_SCHEMA_MISMATCH')

As per this AWS Forum Thread, does anyone know how to use AWS Glue to create an AWS Athena table whose partitions contain different schemas (in this case different subsets of columns from the table schema)? At the moment, when I run the crawler over…

asked Sep 15 '17 at 13:44

rjmurt

votes

9 answers

Can I test AWS Glue code locally?

After reading Amazon docs, my understanding is that the only way to run/test a Glue script is to deploy it to a dev endpoint and debug remotely if necessary. At the same time, if the (Python) code consists of multiple files and packages, all except…

python amazon-web-services aws-glue

asked Jan 18 '18 at 05:15

lfk

1,903
3
19
38

votes

7 answers

AWS Glue Crawler Not Creating Table

I have a crawler I created in AWS Glue that does not create a table in the Data Catalog after it successfully completes. The crawler takes roughly 20 seconds to run and the logs show it successfully completed. CloudWatch log shows: Benchmark:…

amazon-web-services aws-glue

asked Nov 01 '17 at 17:02

Vince

votes

6 answers

How do I write messages to the output log on AWS Glue?

AWS Glue jobs log output and errors to two different CloudWatch logs, /aws-glue/jobs/error and /aws-glue/jobs/output by default. When I include print() statements in my scripts for debugging, they get written to the error log (/aws-glue/jobs/error).…

pyspark aws-glue

asked Feb 21 '18 at 19:51

Jesse Clark

1,031
2
13
15

votes

6 answers

AWS Glue to Redshift: Is it possible to replace, update or delete data?

Here are some bullet points in terms of how I have things setup: I have CSV files uploaded to S3 and a Glue crawler setup to create the table and schema. I have a Glue job setup that writes the data from the Glue table to our Amazon Redshift…

amazon-web-services jdbc pyspark aws-glue

asked Sep 14 '17 at 21:08

krchun

votes

6 answers

Can we consider AWS Glue as a replacement for EMR?

Just a quick question to clarify from Masters, since AWS Glue as an ETL tool, can provide companies with benefits such as, minimal or no server maintenance, cost savings by avoiding over-provisioning or under-provisioning resources, besides running…

amazon-web-services etl amazon-emr aws-glue

asked Jan 12 '18 at 09:09

Yuva

1,842
3
17
42

votes

5 answers

AWS Glue: How to handle nested JSON with varying schemas

Objective: We're hoping to use the AWS Glue Data Catalog to create a single table for JSON data residing in an S3 bucket, which we would then query and parse via Redshift Spectrum. Background: The JSON data is from DynamoDB Streams and is deeply…

amazon-redshift aws-glue amazon-dynamodb-streams amazon-redshift-spectrum

asked Mar 23 '18 at 21:09

ehelander

votes

4 answers

DynamicFrame vs DataFrame

What is the difference? I know that DynamicFrame was created for AWS Glue, but AWS Glue also supports DataFrame. When should DynamicFrame be used in AWS Glue?

amazon-web-services apache-spark pyspark aws-glue

asked Oct 15 '18 at 18:12

Alex Oh

votes

1 answer

AWS Glue Job Input Parameters

I am relatively new to AWS and this may be a bit less technical question, but at present AWS Glue notes a maximum of 25 jobs permitted to be created. We are loading in a series of tables that each have their own job that subsequently appends audit…

amazon-web-services aws-glue

asked Sep 13 '18 at 15:08

Sauron

5,537
13
59
106

votes

1 answer

AWS Athena concurrency limits: Number of submitted queries VS number of running queries

According to AWS Athena limitations you can submit up to 20 queries of the same type at a time, but it is a soft limit and can be increased on request. I use boto3 to interact with Athena and my script submits 16 CTAS queries each of which takes…

concurrency limit amazon-emr amazon-athena aws-glue

asked Jul 22 '19 at 12:22

Ilya Kisil

1,693
1
9
24

votes

3 answers

Overwrite parquet files from dynamic frame in AWS Glue

I use dynamic frames to write a parquet file in S3 but if a file already exists my program append a new file instead of replace it. The sentence that I use is this: glueContext.write_dynamic_frame.from_options(frame = table, …

amazon-web-services parquet aws-glue

asked Aug 24 '18 at 09:47

Mateo Rod

votes

2 answers

AWS Glue issue with double quote and commas

I have this CSV file: reference,address V7T452F4H9,"12410 W 62TH ST, AA D" The following options are being used in the table definition ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( 'quoteChar'='\"', …

hadoop hive presto amazon-athena aws-glue

asked May 15 '18 at 15:35

ln9187

votes

4 answers

AWS Glue pricing against AWS EMR

I am doing some pricing comparison between AWS Glue against AWS EMR so as to chose between EMR & Glue. I have considered 6 DPUs (4 vCPUs + 16 GB Memory) with ETL Job running for 10 minutes for 30 days. Expected crawler requests is assumed to be 1…

amazon-web-services amazon-emr aws-glue cost-management

asked Feb 07 '18 at 11:32

Yuva

1,842
3
17
42

votes

12 answers

Use AWS Glue Python with NumPy and Pandas Python Packages

What is the easiest way to use packages such as NumPy and Pandas within the new ETL tool on AWS called Glue? I have a completed script within Python I would like to run in AWS Glue that utilizes NumPy and Pandas.

python pandas amazon-web-services aws-lambda aws-glue

asked Sep 20 '17 at 18:42

jumpman23

votes

3 answers

AWS Glue takes a long time to finish

I just run a very simple job as follows glueContext = GlueContext(SparkContext.getOrCreate()) l_table = glueContext.create_dynamic_frame.from_catalog( database="gluecatalog", table_name="fctable") l_table =…

amazon-web-services aws-glue

asked Aug 29 '17 at 19:36

Shawn

4,718
11
56
97

2 3

…

99 100 Next