Questions tagged [data-lake]

112 questions
15
votes
6 answers

Hadoop Vs Data Lake

I heard a new term Data Lake. I googled and got that A data lake is a large-scale storage repository and processing engine. A data lake provides "massive storage for any kind of data, enormous processing power and the ability to handle virtually…
Kishore
  • 5,315
  • 4
  • 21
  • 50
7
votes
3 answers

Is Data Lake and Big Data the same?

I am trying to understand all if there is a real difference between data lake and Big data if you check the concepts both are like a Big repository which saves the information until it becomes necessary, so, When can we say that we are using big…
user3342209
  • 95
  • 1
  • 5
7
votes
2 answers

AWS Glue Data Catalog as Metastore for external services like Databricks

Let's say, the datalake is on AWS. Using S3 as storage and Glue as data catalog. So, we can easily use athena, redshift or EMR to query data on S3 using Glue as metastore. My question is, is it possible to expose Glue data catalog as metastore for…
Obaid
  • 227
  • 2
  • 14
4
votes
3 answers

Data Governance solution for Databricks, Synapse and ADLS gen2

I'm new to data governance, forgive me if question lack some information. Objective We're building data lake & enterprise data warehouse from scratch for mid-size telecom company on Azure platform. We're using ADLS gen2, Databricks and Synapse for…
VB_
  • 43,322
  • 32
  • 111
  • 238
4
votes
1 answer

Flatten JSON with array using AWS Glue crawler / classifier / ETL job

I'm crawling following JSON file (it's a valid JSON) from s3 data lake. Inside there are 2 fields (device, timestamp) and an array of objects called "data". Each object in the data array differs from one another. { "device": "0013374838793C8", …
3
votes
1 answer

AWS Data Lake Dynamo vs ElasticSearch

I am really struggling to understand how Dynamo / ElasticSearch should be used to support AWS data lake efforts (Metadata / Catalogs). It seems as though you would log the individual S3 locations of your zip archives for your sources in Dynamo and…
3
votes
2 answers

Metadata management for (Azure) data-lake

To my understanding, the data-lake solution is used for storing everything from raw-data in the original format to processed data. I have not able to understand the concept of metadata-management in the (Azure) data-lake though. What are…
3
votes
2 answers

Does ROWCOUNT hint works for EXTRACT in U-SQL

I want to allocate more vertexes to the extraction job, tried using ROWCOUNT hint, it doesn't seem to work, no matter what value I use for ROWCOUNT, U-SQL always allocate the same number of vertexes. EXTRACT xxxx FROM @"Path" USING new…
lidong
  • 454
  • 3
  • 13
3
votes
2 answers

Powershell -recursive in Azure Data Lake Store

Do someone know how to list every file in a directory inside data lake store and sub directories? apparently the -recursive instruction does not work as it does in a normal environment I need to run this script in Azure Data Lake Store, (which runs…
Rafa
  • 323
  • 4
  • 10
3
votes
2 answers

Is DynamoDB suitable as an S3 Metadata index?

I would like to store and query a large quantity of raw event data. The architecture I would like to use is the 'data lake' architecture where S3 holds the actual event data, and DynamoDB is used to index it and provide metadata. This is an…
Alex Spurling
  • 47,884
  • 23
  • 63
  • 71
2
votes
1 answer

On-premise delta lake

Is it possible to implement a delta lake on-premise ? if yes, what softwares/tools needs to be installed? I'm trying to implement a delta lake on premise to analyze some log files and database tables. My current machine is loaded with ubuntu, apache…
Ajoy
  • 63
  • 1
  • 1
  • 9
2
votes
2 answers

Can the raw data layer of a Data Lake contain a Table?

All the Data Lake articles I have read on the web say that the landing area contains raw data in the form of files. But let us say, I am ingesting streaming data from some IoT devices. Can I then put this data directly into a Table (For example a…
MetallicPriest
  • 25,675
  • 38
  • 166
  • 299
2
votes
2 answers

Data Lake: fix corrupted files on Ingestion vs ETL

Objective I'm building datalake, the general flow looks like Nifi -> Storage -> ETL -> Storage -> Data Warehouse. The general rule for Data Lake sounds like no pre-processing on ingestion stage. All ongoing processing should happen at ETL, so you…
VB_
  • 43,322
  • 32
  • 111
  • 238
2
votes
1 answer

Database vs DataMart vs Data Warehouse vs Data Lake

Looking for the high-level differences/comparison among Database Data Mart (Top-down approach) Data Warehouse Data Lake Please use relative comparison when specifics are not available.
Ashok Goli
  • 4,708
  • 7
  • 32
  • 62
2
votes
2 answers

AWS Glue Spark Job Fails to Support Upper case Column Name with Double Quotes

Problem Statement/Root Cause: We are using AWS Glue to load data from a production PostGress DB into AWS DataLake. Glue internally uses Spark job to move the data. Our ETL process is, however, failing as Spark only supports lowercase table column…
1
2 3 4 5 6 7 8