2

Using Python. Is there any way to add an extra field while processing a csv file to Big Query. I'd like to add a date_loaded field with the current date ?

Google code example I have used ..

# from google.cloud import bigquery
# client = bigquery.Client()
# dataset_id = 'my_dataset'

dataset_ref = client.dataset(dataset_id)
job_config = bigquery.LoadJobConfig()
job_config.schema = [
    bigquery.SchemaField('name', 'STRING'),
    bigquery.SchemaField('post_abbr', 'STRING')
]
job_config.skip_leading_rows = 1    
# The source format defaults to CSV, so the line below is optional.
job_config.source_format = bigquery.SourceFormat.CSV
uri = 'gs://cloud-samples-data/bigquery/us-states/us-states.csv'
    load_job = client.load_table_from_uri(
    uri,
    dataset_ref.table('us_states'),
    job_config=job_config)  # API request
print('Starting job {}'.format(load_job.job_id))

load_job.result()  # Waits for table load to complete.
print('Job finished.')

destination_table = client.get_table(dataset_ref.table('us_states'))
print('Loaded {} rows.'.format(destination_table.num_rows))
mez63
  • 156
  • 1
  • 15
  • 1
    Does [date partitioned tables](https://cloud.google.com/bigquery/docs/partitioned-tables#ingestion-time_partitioned_tables) work for you? If not, maybe a better approach would be to use [apache beam](https://beam.apache.org/documentation/sdks/python/) instead. If still it doesnt work, then only way out I see is to bring this data to local, iterate over it and add the date field. If you are working with lots of data this is not recommended though. – Willian Fuks Nov 21 '18 at 15:16
  • 1
    ..or load it into a staging/tmp table in BigQuery then hit it with SQL, and add the `date_loaded` field as part of that SQL transform. Write the results to your main table. If you use ingestion based partition table, just be aware that its's in UTC unless you address the partition directly (https://cloud.google.com/bigquery/docs/querying-partitioned-tables#querying_ingestion-time_partitioned_tables_using_time_zones) – Graham Polley Nov 22 '18 at 12:15

2 Answers2

3

By modifying this Python example to fit your issue you open and read the original CSV file from my local PC, edit it by adding a column and append timestamps at the end of each line to avoid having an empty column. This link explains how to get a timestamp in Python with custom date and time.

Then you write the resulting data to an output file and load it to Google Storage. Here you can find the information on how to run external commands from a Python file.

I hope this helps.

#Import the dependencies
import csv,datetime,subprocess
from google.cloud import bigquery

#Replace the values for variables with the appropriate ones
#Name of the input csv file
csv_in_name = 'us-states.csv'
#Name of the output csv file to avoid messing up the original
csv_out_name = 'out_file_us-states.csv'
#Name of the NEW COLUMN NAME to be added
new_col_name = 'date_loaded'
#Type of the new column
col_type = 'DATETIME'
#Name of your bucket
bucket_id = 'YOUR BUCKET ID'
#Your dataset name
ds_id = 'YOUR DATASET ID'
#The destination table name
destination_table_name = 'TABLE NAME'


# read and write csv files
with open(csv_in_name,'r') as r_csvfile:
    with open(csv_out_name,'w') as w_csvfile:

        dict_reader = csv.DictReader(r_csvfile,delimiter=',')
        #add new column with existing
        fieldnames = dict_reader.fieldnames + [new_col_name]
        writer_csv = csv.DictWriter(w_csvfile,fieldnames,delimiter=',')
        writer_csv.writeheader()


        for row in dict_reader:
#Put the timestamp after the last comma so that the column is not empty
            row[new_col_name] = datetime.datetime.now()
            writer_csv.writerow(row)

#Copy the file to your Google Storage bucket
subprocess.call('gsutil cp ' + csv_out_name + ' gs://' + bucket_id , shell=True)


client = bigquery.Client()

dataset_ref = client.dataset(ds_id)
job_config = bigquery.LoadJobConfig()
#Add a new column to the schema!
job_config.schema = [
    bigquery.SchemaField('name', 'STRING'),
    bigquery.SchemaField('post_abbr', 'STRING'),
    bigquery.SchemaField(new_col_name, col_type)
]
job_config.skip_leading_rows = 1
# The source format defaults to CSV, so the line below is optional.
job_config.source_format = bigquery.SourceFormat.CSV
#Address string of the output csv file
uri = 'gs://' + bucket_id + '/' + csv_out_name
load_job = client.load_table_from_uri(uri,dataset_ref.table(destination_table_name),job_config=job_config)  # API request
print('Starting job {}'.format(load_job.job_id))

load_job.result()  # Waits for table load to complete.
print('Job finished.')

destination_table = client.get_table(dataset_ref.table(destination_table_name))
print('Loaded {} rows.'.format(destination_table.num_rows))
Fim
  • 103
  • 1
  • 9
1

You can keep loading your data as you are loading, but into a table called old_table.

Once loaded, you can run something like:

bq --location=US query --destination_table mydataset.newtable --use_legacy_sql=false --replace=true 'select *, current_date() as date_loaded from mydataset.old_table'

This basically loads the content of old table with a new column of date_loaded at the end to the new_table. This way, you now have a new column without downloading locally or all the mess.

khan
  • 5,557
  • 11
  • 44
  • 64