1

as per question, I am trying to store picked python objects to snowflake, to get them back again at a later date. Help on this would be much appreciated:

Snowflake table definition:

CREATE OR REPLACE TABLE <db>.<schema>.TESTING_MEMORY (
    MODEL_DATETIME DATETIME,
    SCALARS VARIANT
;

Python code:

import numpy as np
import pandas as pd
import pickle
from datetime import datetime
import snowflake.connector
from snowflake.connector.pandas_tools import write_pandas
from sklearn.preprocessing import StandardScaler

def create_snowflake_connection():
    conn = snowflake.connector.connect(
        user='<username>',
        account='<account>',
        password = '<password>',
        warehouse='<wh>',
        database='<db>',
        role='<role>',
        schema='<schema>'
    )
    
    return conn

memory = {}

np.random.seed(78)
df = pd.DataFrame({
    'x1': np.random.normal(0, 2, 10000),
    'x2': np.random.normal(5, 3, 10000),
    'x3': np.random.normal(-5, 5, 10000)
})

scaler = StandardScaler()
scaler.fit(df)

scaled_df = scaler.transform(df)
scaled_df = pd.DataFrame(scaled_df, columns=['x1', 'x2', 'x3'])


memory['SCALARS'] = pickle.dumps(scaler)
    

ctx = create_snowflake_connection()


# Write to snowflake
db_dat = pd.DataFrame([list(memory.values())], columns=list(memory.keys()))
db_dat.insert(0, 'MODEL_DATETIME', datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f"))
success, nchunks, nrows, _ = write_pandas(conn=ctx, df = db_dat, table_name = 'TESTING_MEMORY')
  
# retreive from snowflake
cur = ctx.cursor()
sql = """
        SELECT hex_encode(SCALARS)
        FROM <db>.<schema>.TESTING_MEMORY
        QUALIFY ROW_NUMBER() OVER (ORDER BY MODEL_DATETIME DESC) = 1
"""
cur.execute(sql) 
returned = cur.fetch_pandas_all() 


cur.close()
ctx.close()
user1420372
  • 1,687
  • 2
  • 17
  • 33

2 Answers2

1

There is probably a better way to do this (disclaimer: I am new to Python), but this seems to work and is based off answer here: How can I pickle a python object into a csv file?

  1. Change sql table defintion

    CREATE OR REPLACE TABLE db.schema.TESTING_MEMORY (
    MODEL_DATETIME DATETIME,
    SCALARS VARCHAR
    );
    

2 Changes to Python Code - general

import base64

3 Changes to Python code (write to snowflake section above)

# Write to snowflake
db_dat = pd.DataFrame([list(memory.values())], columns=list(memory.keys()))
db_dat.insert(0, 'MODEL_DATETIME', datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f"))

pickled_columns = ['SCALARS']
for column in pickled_columns:
    b64_bytes = base64.b64encode(db_dat[column].values[0])  
    db_dat[column] = b64_bytes.decode('utf8')
success, nchunks, nrows, _ = write_pandas(conn=ctx, df = db_dat, table_name = 'TESTING_MEMORY')
  1. Changes to Python code - retrieve from snowflake

    cur = ctx.cursor() 
    
    sql = """
         SELECT *
         FROM db.schema.TESTING_MEMORY
         QUALIFY ROW_NUMBER() OVER (ORDER BY MODEL_DATETIME DESC) = 1
    """
    
    cur.execute(sql) 
    returned = cur.fetch_pandas_all() 
    
    for column in pickled_columns:
        returned[column] =  base64.b64decode(returned[column].values[0])
    
    
    new_dict = returned.to_dict('list')
    for key,val in new_dict.items():
        new_dict[key] = val[0]
    
desertnaut
  • 46,107
  • 19
  • 109
  • 140
user1420372
  • 1,687
  • 2
  • 17
  • 33
  • 1
    This option works but, in the answer you linked to, the reason they use base64 is that it is a csv file which only accepts text like utf-8 / ascii etc. In Snowflake, you can use a `binary` datatype instead which is more efficient. – Simon Darr Sep 21 '20 at 17:11
1

Seems like you're trying to put python byte object into a Snowflake variant which won't work for you.

This answer is kind of similar to what the other answer here suggests except, rather than using a varchar field to store base64 encoded binary, use a binary type instead. base64 encoding is around 30% larger than binary from what I've read somewhere.

Create the table with binary data type:

create or replace table testdb.public.test_table (obj binary);

Hex encode the pickled object, write it, read it back and call a method on it:

import pickle
import snowflake.connector

# This is the object we're going to store in Snowflake as binary
class PickleMe:
    def __init__(self, first_name, last_name):
        self.first_name = first_name
        self.last_name = last_name

    def say_hello(self):
        print(f'Hi there, {self.first_name} {self.last_name}')

# Create the object and store it as hex in the 'hex_person' variable
person = PickleMe('John', 'Doe')
hex_person = pickle.dumps(person).hex()

with snowflake.connector.connect(
    user="username",
    password="password",
    account="snowflake_account_deets",
    warehouse="warehouse_name",
) as con:
    # Write pickled object into table as binary
    con.cursor().execute(f"INSERT INTO testdb.public.test_table values(to_binary('{hex_person}', 'HEX'))")

    # Now get the object back and put it into the 'obj' variable
    (obj,) = con.cursor().execute(f"select obj from testdb.public.test_table").fetchone()

    # Deserialise object and call method on it
    person_obj = pickle.loads(obj, encoding='HEX')
    person_obj.say_hello()

The output of the above is

Hi there, John Doe

Simon Darr
  • 3,429
  • 1
  • 14
  • 25
  • Hi @SimonD - on a larger set of data, I am getting an error that the binary value is too long (it was a birch model from sklearn)... the idea was to store the serialised model versions in the database. Do you have any workarounds for this? I just posted as another question here: https://stackoverflow.com/questions/64056846/efficient-way-to-store-sklearn-birch-model-in-snowflake-and-bypass-8mb-binary-l – user1420372 Sep 25 '20 at 02:17