6

Google Cloud Dataprep seems great and we've used it to manually import static datasets, however I would like to execute it more than once so that it can consume new files uploaded to a GCS path. I can see that you can setup a schedule for Dataprep, but I cannot see anywhere in the import setup how it would process new files.

Is this possible? Seems like an obvious need - hopefully I've missed something obvious.

Matt Byrne
  • 4,453
  • 2
  • 31
  • 46

2 Answers2

7

A further update on this. Since my question a new release of Dataprep on Jan 23 2018 includes the ability to re-run dataflow jobs independently of Dataprep.

When you execute a Dataprep job it will generate a Dataflow template that you can use to trigger jobs manually in the future and it allows certain parameters to be passed in.

Steps to be able to trigger on new files (please note this is Beta so Google may change exact process):

  1. Create your flow and run your relevant flow/recipe. Iterate/repeat manually until you have your recipe how you want it. When you are happy run, run the job again (should be a job that appends data rather than replace since you likely want to append new content). It's probably a good idea to uncheck "Profile results" (new feature) to reduce overhead since this will be a repeatable job.
  2. Once complete, go to the Job details page and click Export Results button and there you should see a link to the Dataflow template. Copy the text. Note that the Dataflow template path with only be available for jobs executed after the Jan 23 2018 release since it was a new feature.
  3. You can then see how to trigger a dataflow job by going to DataFlow and selecting CREATE JOB FROM TEMPLATE, selecting Custom template and pasting in your template path. There you will see the parameters you can supply such as your GCS input path
  4. Write a Google Cloud Function that is triggered from a GCS write and using the details of the event execute the template with your file path as per step (3) above.
Matt Byrne
  • 4,453
  • 2
  • 31
  • 46
  • 1
    I'd add that the March 20, 2019 update further improves this with the introduction of the `$filepath` metadata reference—meaning that if your files or directories have meaningful date or other key data in them, you could choose to drop rows not matching a certain criteria based on that and only import a subset of the data. (see https://cloud.google.com/dataprep/docs/html/Source-Metadata-References_136155148 ) – justbeez Mar 27 '19 at 20:14
6

You can add a GCS path as a dataset by clicking on the + icon left of the folder during the dataset (see screenshot). When you set up a scheduled job for a flow that uses this dataset, all files in that directory (including new files) will be picked up on each scheduled job run.

enter image description here

Lars Grammel
  • 1,627
  • 14
  • 19
  • 1
    Thanks @Lars. Do you know what the general approach is to avoid dupes and append data from new files to a BigQuery table? Move/remove processed files, or do you need to do a "De-duplicate transform" step on each run (will only get larger over time)? – Matt Byrne Dec 04 '17 at 19:36
  • 2
    Moving processed files to a different directory would be one approach. You could then setup the job to append to the BQ table. – Lars Grammel Dec 05 '17 at 08:45
  • 1
    Unfortunately not all of this can happen within Dataprep itself ... a bit limiting. I guess you could write some other cron task somewhere to use `gsutil` to clear the relevant folder at set times but that assumes everything went well. I guess these are all just signs of a beta product. – Matt Byrne Dec 14 '17 at 02:23