AWS Glue is a fully managed ETL (extract, transform, and load) service that can categorize your data, clean it, enrich it, and move it between various data stores. AWS Glue consists of a central data repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python code, and a scheduler that handles dependency resolution, job monitoring, and retries. AWS Glue is serverless, so there's no infrastructure to manage.
AWS Glue consists of a number of components components:
- A data catalog (implementing functionality of a Hive Metastore) across AWS data sources, primarily S3, but also any JDBC data source on AWS including Amazon RDS and Amazon Redshift
- Crawlers, which perform data classification and schema discovery across S3 data and register data with the data catalog
- A distributed data processing framework which extends PySpark with functionality for increased schema flexibility.
- Code generation tools to template and bootstrap data processing scripts
- Scheduling for crawlers and data processing scripts
- Serverless development and execution of scripts in an Apache Spark (2.x) environment.
Data registered in the AWS Glue Data Catalog is available to many AWS Services, including
- Amazon Redshift Spectrum
- EMR (Hadoop, Hive, HBase, Presto, Spark, Impala, etc.)
- Amazon Athena
- AWS Glue scripts