Questions tagged [google-cloud-dataprep]

An intelligent cloud data service to visually explore, clean, and prepare data for analysis.

DataPrep (or more accurately Cloud Dataprep by Trifacta) is a visual data transformation tool built by Trifacta and offered as part of Google Cloud Platform.

It is capable of ingesting data from and writing data to several other Google services (BigQuery, Cloud Storage).

Data is transformed using recipes which are shown alongside a visual representation of the data. This allows the user to preview changes, profile columns and spot outliers and type mismatches.

When a DataPrep flow is run (either manually or scheduled), a DataFlow job is created to run the task. DataFlow is Google's managed Apache Beam service.

196 questions
6
votes
1 answer

Can Google Data Fusion make the same data cleaning than DataPrep?

I want to run a machine learning model with some data. Before train the model with this data I need to process it, so I have been reading some ways to do it. First of all create a Dataflow pipeline to upload it to Bigquery or Google Cloud Storage,…
6
votes
2 answers

Can Google Cloud Dataprep monitor a GCS path for new files?

Google Cloud Dataprep seems great and we've used it to manually import static datasets, however I would like to execute it more than once so that it can consume new files uploaded to a GCS path. I can see that you can setup a schedule for Dataprep,…
Matt Byrne
  • 4,453
  • 2
  • 31
  • 46
5
votes
0 answers

Job Fails with odd message

I have a job that is failing at the very start of the message: "@*" and "@N" are reserved sharding specs. Filepattern must not contain any of them. I have altered the destination location to be something other than the default (an email address)…
williamvicary
  • 777
  • 4
  • 20
4
votes
3 answers

Dataprep vs Dataflow vs Dataproc

To perform source data preparation, data transformation or data cleansing, in what scenario should we use Dataprep vs Dataflow vs Dataproc?
4
votes
1 answer

Dataprep - Scheduling Jobs

To anyone on the Dataprep beta, is it possible to schedule jobs being run? If so, is it the cron service via the app engine? I can't quite follow the cron for app engine instructions but want to make sure it's not a dead end before I try Thanks
Aaron Harris
  • 383
  • 1
  • 3
  • 14
3
votes
2 answers

How do I run Google Dataprep jobs automatically?

Is there a way to trigger Google Dataprep flow over API? I need to run like 30 different flows every day. Every day the source dataset changes and the result has to be appended to Google BigQuery table. Is there a way to automate this process?…
stkvtflw
  • 9,002
  • 19
  • 50
  • 110
3
votes
1 answer

How to use Google Data Prep API using Python

Google Just launched the new API. Link is here. I want to know what is the host in this case as they are using example.com and using the port 3005. I am also following this article. But this does not provide example code.
3
votes
2 answers

Add dataset parameters into column to use them in BigQuery later with DataPrep

I am importing several files from Google Cloud Storage (GCS) through Google DataPrep and store the results in tables of Google BigQuery. The structure on GCS looks something like…
3
votes
1 answer

How do I chain multiple Google Cloud DataPrep flows?

I've created two Flows in Cloud DataPrep - the first outputs to a BigQuery table and also creates a reference dataset. The second flow takes the reference dataset and processes it further before outputting to a second BigQuery table. Is it possible…
angusham
  • 88
  • 5
3
votes
1 answer

Executing a Dataflow job with multiple inputs/outputs using gcloud cli

I've designed a data transformation in Dataprep and am now attempting to run it by using the template in Dataflow. My flow has several inputs and outputs - the dataflow template provides them as a json object with key/value pairs for each input &…
3
votes
1 answer

Google Cloud Dataprep Import Recipes

I can see it's possible to download a recipe but I can't see any option to import it, does anyone know if there is this option?
3
votes
1 answer

DataPrep: access to source filename

Is there a way to create a column with filename of the source that created each row ? Use-Case: I would like to track which file in a GCS bucket resulted in the creation of which row in the resulting dataset. I would like a scheduled transformation…
jldupont
  • 82,560
  • 49
  • 190
  • 305
3
votes
1 answer

How to export file with headers in Google Dataprep?

I am trying to export the results of a Google Dataprep job. As you can see in the following screenshot, the columns have names or headers: However, the exported file is not including them. How can I keep those column headers in the exported CSV…
Milton
  • 691
  • 1
  • 11
  • 25
2
votes
1 answer

Combine multiple rows into single row in Google Data Prep

I have a table which has multiple payload values in separate rows. I want to combine those rows into a single row to have all the data together. Table looks something like this. +------------+--------------+------+----+----+----+----+ | Date |…
VSR
  • 117
  • 1
  • 13
2
votes
1 answer

Cloud Dataprep BigQuery Upsert

Is there a way to update rows in Google BigQuery when publishing from Cloud Dataprep? I can't find anything in the documentation. I have a dataset I'm preprocessing with Dataprep that contains new rows and updated rows on every (daily) run. I would…
Andii
  • 214
  • 1
  • 2
  • 12
1
2 3
13 14