Highest Voted 'data-lake' Questions

15

votes

6 answers

Hadoop Vs Data Lake

I heard a new term Data Lake. I googled and got that A data lake is a large-scale storage repository and processing engine. A data lake provides "massive storage for any kind of data, enormous processing power and the ability to handle virtually…

hadoop data-warehouse data-lake

asked Mar 14 '16 at 12:24

Kishore

5,315
4
21
50

7

votes

3 answers

Is Data Lake and Big Data the same?

I am trying to understand all if there is a real difference between data lake and Big data if you check the concepts both are like a Big repository which saves the information until it becomes necessary, so, When can we say that we are using big…

bigdata data-lake

asked Sep 18 '18 at 15:30

user3342209

95
1
5

7

votes

2 answers

AWS Glue Data Catalog as Metastore for external services like Databricks

Let's say, the datalake is on AWS. Using S3 as storage and Glue as data catalog. So, we can easily use athena, redshift or EMR to query data on S3 using Glue as metastore. My question is, is it possible to expose Glue data catalog as metastore for…

amazon-s3 databricks aws-glue data-lake hive-metastore

asked Apr 16 '18 at 02:36

Obaid

227
2
14

4

votes

3 answers

Data Governance solution for Databricks, Synapse and ADLS gen2

I'm new to data governance, forgive me if question lack some information. Objective We're building data lake & enterprise data warehouse from scratch for mid-size telecom company on Azure platform. We're using ADLS gen2, Databricks and Synapse for…

azure architecture databricks data-lake azure-data-catalog

asked May 11 '20 at 22:20

VB_

43,322
32
111
238

4

votes

1 answer

Flatten JSON with array using AWS Glue crawler / classifier / ETL job

I'm crawling following JSON file (it's a valid JSON) from s3 data lake. Inside there are 2 fields (device, timestamp) and an array of objects called "data". Each object in the data array differs from one another. { "device": "0013374838793C8", …

json amazon-web-services amazon-athena aws-glue data-lake

asked Mar 19 '19 at 11:47

Maciej Malak

96
1
8

3

votes

1 answer

AWS Data Lake Dynamo vs ElasticSearch

I am really struggling to understand how Dynamo / ElasticSearch should be used to support AWS data lake efforts (Metadata / Catalogs). It seems as though you would log the individual S3 locations of your zip archives for your sources in Dynamo and…

amazon-web-services elasticsearch amazon-s3 amazon-dynamodb data-lake

asked Oct 09 '17 at 18:38

scarpacci

8,076
14
73
131

3

votes

2 answers

Metadata management for (Azure) data-lake

To my understanding, the data-lake solution is used for storing everything from raw-data in the original format to processed data. I have not able to understand the concept of metadata-management in the (Azure) data-lake though. What are…

azure metadata azure-data-lake database-metadata data-lake

asked Mar 27 '17 at 06:08

AlexGuevara

840
8
23

3

votes

2 answers

Does ROWCOUNT hint works for EXTRACT in U-SQL

I want to allocate more vertexes to the extraction job, tried using ROWCOUNT hint, it doesn't seem to work, no matter what value I use for ROWCOUNT, U-SQL always allocate the same number of vertexes. EXTRACT xxxx FROM @"Path" USING new…

azure-data-lake u-sql data-lake

asked Mar 07 '17 at 21:30

lidong

454
3
13

3

votes

2 answers

Powershell -recursive in Azure Data Lake Store

Do someone know how to list every file in a directory inside data lake store and sub directories? apparently the -recursive instruction does not work as it does in a normal environment I need to run this script in Azure Data Lake Store, (which runs…

powershell azure recursion azure-data-lake data-lake

asked Dec 22 '16 at 02:16

Rafa

323
4
10

3

votes

2 answers

Is DynamoDB suitable as an S3 Metadata index?

I would like to store and query a large quantity of raw event data. The architecture I would like to use is the 'data lake' architecture where S3 holds the actual event data, and DynamoDB is used to index it and provide metadata. This is an…

amazon-s3 amazon-dynamodb data-lake

asked Nov 10 '16 at 15:05

Alex Spurling

47,884
23
63
71

2

votes

1 answer

On-premise delta lake

Is it possible to implement a delta lake on-premise ? if yes, what softwares/tools needs to be installed? I'm trying to implement a delta lake on premise to analyze some log files and database tables. My current machine is loaded with ubuntu, apache…

delta-lake data-lake

asked Feb 09 '21 at 19:36

Ajoy

63
1
1
9

2

votes

2 answers

Can the raw data layer of a Data Lake contain a Table?

All the Data Lake articles I have read on the web say that the landing area contains raw data in the form of files. But let us say, I am ingesting streaming data from some IoT devices. Can I then put this data directly into a Table (For example a…

hadoop hive data-lake

asked Jun 03 '20 at 22:13

MetallicPriest

25,675
38
166
299

2

votes

2 answers

Data Lake: fix corrupted files on Ingestion vs ETL

Objective I'm building datalake, the general flow looks like Nifi -> Storage -> ETL -> Storage -> Data Warehouse. The general rule for Data Lake sounds like no pre-processing on ingestion stage. All ongoing processing should happen at ETL, so you…

architecture etl data-ingestion data-lake

asked May 14 '20 at 11:29

VB_

43,322
32
111
238

2

votes

1 answer

Database vs DataMart vs Data Warehouse vs Data Lake

Looking for the high-level differences/comparison among Database Data Mart (Top-down approach) Data Warehouse Data Lake Please use relative comparison when specifics are not available.

database comparison data-warehouse data-lake datamart

asked May 12 '20 at 12:23

Ashok Goli

4,708
7
32
62

2

votes

2 answers

AWS Glue Spark Job Fails to Support Upper case Column Name with Double Quotes

Problem Statement/Root Cause: We are using AWS Glue to load data from a production PostGress DB into AWS DataLake. Glue internally uses Spark job to move the data. Our ETL process is, however, failing as Spark only supports lowercase table column…

pyspark aws-glue aws-glue-data-catalog data-lake

asked Sep 25 '19 at 07:20

rajmohan k

21
3

Questions tagged [data-lake]