0

I have created a database using sqlite3 in python that has thousands of tables. Each of these tables contains thousands of rows and ten columns. One of the columns is the date and time of an event: it is a string that is formatted as YYYY-mm-dd HH:MM:SS, which I have defined to be the primary key for each table. Every so often, I collect some new data (hundreds of rows) for each of these tables. Each new dataset is pulled from a server and loaded in directly as a pandas data frame or is stored as a CSV file. The new data contains the same ten columns as my original data. I need to update the tables in my database using this new data in the following way:

  1. Given a table in my database, for each row in the new dataset, if the date and time of the row matches the date and time of an existing row in my database, update the remaining columns of that row using the values in the new dataset.
  2. If the date and time does not yet exist, create a new row and insert it to my database.

Below are my questions:

  1. I've done some searching on Google and it looks like I should be using the UPSERT (merge) functionality of sqlite but I can't seem to find any examples showing how to use it. Is there an actual UPSERT command, and if so, could someone please provide an example (preferably with sqlite3 in Python) or point me to a helpful resource?
  2. Also, is there a way to do this in bulk so that I can UPSERT each new dataset into my database without having to go row by row? (I found this link, which suggests that it is possible, but I'm new to using databases and am not sure how to actually run the UPSERT command.)
  3. Can UPSERT also be performed directly using pandas.DataFrame.to_sql?

My backup solution is loading in the table to be UPSERTed using pd.read_sql_query("SELECT * from table", con), performing pandas.DataFrame.merge, deleting the said table from the database, and then adding in the updated table to the database using pd.DataFrame.to_sql (but this would be inefficient).

Vivek Subramanian
  • 934
  • 1
  • 13
  • 23
  • Does this answer your question? [SQLite - UPSERT \*not\* INSERT or REPLACE](https://stackoverflow.com/questions/418898/sqlite-upsert-not-insert-or-replace) – sophros Feb 14 '20 at 06:16
  • There are examples in the documentation you linked to... – Shawn Feb 14 '20 at 09:12
  • @sophros, I don't think the question you linked answers my question because (1) I'm still curious about whether there is an explicit `upsert` / `merge` command I should be using and (2) the answer shows an iterative solution but I'm looking for a way to do `upsert` in bulk. – Vivek Subramanian Feb 14 '20 at 14:07
  • @Shawn, the documentation doesn't feature a specific call to an `upsert` command. Is this something that you just implement as they have done with an `insert` and `on conflict` clause? – Vivek Subramanian Feb 14 '20 at 14:07
  • 1
    There's no actual statement called upsert.... – Shawn Feb 14 '20 at 16:59

2 Answers2

2

Instead of going through upsert command, why don't you create your own algorithim that will find values and replace them if date & time is found, else it will insert new row. Check out my code, i wrote for you. Let me know if you are still confused. You can even do that for hundereds of tables just by replacing table name in algorithim with some variable and changing it for the whole list of your table names.

import sqlite3
import pandas as pd

csv_data = pd.read_csv("my_CSV_file.csv")           # Your CSV Data Path

def manual_upsert():                          
    con = sqlite3.connect(connection_str)
    cur = con.cursor()
    cur.execute("SELECT * FROM my_CSV_data")        # Viewing Data from Column
    data = cur.fetchall()

    old_data_list = []                          # Collection of All Dates already in Database table.
    for line in data:
        old_data_list.append(line[0])           # I suppose you Date Column is on 0 Index.


    for new_data in csv_data:
        if new_data[0] in old_data_list:
            cur.execute("UPDATE my_CSV_data SET column1=?, column2=?, column3=? WHERE my_date_column=?",        # it will update column based on date if condition is true
                                                (new_data[1],new_data[2],new_data[3],new_data[0]))
        else:
            cur.execute("INSERT INTO my_CSV_data VALUES(?,?,?,?)",                                              # It will insert new row if date is not found.
                                                (new_data[0],new_data[1],new_data[2],new_data[3]))                 
    con.commit()
    con.close()


manual_upsert()
  • Thanks for your answer. The only thing is, this checks `new_data` line by line rather than performing the upsert for all lines in bulk. I'm looking into `transaction`s, as suggested by @sophros to see if this might be a good path. Do you have any other suggestions? – Vivek Subramanian Feb 14 '20 at 13:52
  • Sir, both the situations are same. In every case, somehow SQLor python will have to go through every row and check if there is a conflict or not. I don't think so, there is any method to do this in bulk. Open Connection once, and check / update / insert all of your data and then commit + close your connection. This is super fast with sqlite3 and won't take much time. – Zeeshan Qureshi Feb 14 '20 at 16:47
  • Thanks for the comment. Check out the [SQLite FAQ](https://sqlite.org/faq.html#q19). There, they state that performing multiple insertions is inefficient compared to grouping them into a single transaction. In the former case, each insertion is treated as a separate transaction while in the latter, they are all treated as one transaction. SQLite can handle tens of thousands of insertions per second if grouped into a transaction while it can only do tens of insertions per second if they are executed individually. Also see [this](https://youtu.be/azh2FpZfSUs?t=295). – Vivek Subramanian Feb 15 '20 at 21:27
  • 1
    What do you mean by Transaction in Sqlite3? It is all about opening a connection once, running thousands of queries all at once, then committing your changes & finally closing it. That is another way to define a transaction. Go through my code, this is what i did there. Opened connection once, and Perform all of my task ( updating & inserting ) and then committed changes. Sir, Please at least try out my code, and then come to me, tell me if it is in-efficient. I use it all the time. – Zeeshan Qureshi Feb 16 '20 at 05:29
  • You're right. I was mistaken. I misread your code and thought each `execute()` command was followed by a `commit()`. To clarify, are all queries that are between two `commit()`s part of the same transaction? Is there no need for a `BEGIN` statement when using `sqlite3` in Python? – Vivek Subramanian Feb 16 '20 at 05:40
  • 1
    BEGIN works when you're running / executing SQLITE3 commands through console or some other programs. There is no need when you're running it through python library. It is enough to create a cursor and do all of your queries with the same cursor and then committing changes at the end. It works well. I personally tried & compared Sqlite3 & MS SQL for 1 Billions+ rows for inserting & updating. Sqlite3 is far far more efficient then MS SQL when running all in a single transaction. So go ahead with closed eyes. There won't be any problem. If there is, let me know. – Zeeshan Qureshi Feb 16 '20 at 05:47
  • Thank you for your explanations and your patience. – Vivek Subramanian Feb 16 '20 at 05:51
1

First, even though the questions are related, ask them separately in the future.

  1. There is documentation on UPSERT handling in SQLite that documents how to use it but it is a bit abstract. You can check examples and discussion here: SQLite - UPSERT *not* INSERT or REPLACE

  2. Use a transaction and the statements are going to be executed in bulk.

  3. As presence of this library suggests to_sql does not create UPSERT commands (only INSERT).

sophros
  • 8,714
  • 5
  • 30
  • 57
  • Thanks, @sophros. From what I gather, it looks like `upsert` is just a combination of `insert` and `on conflict`. In the documentation, they give an example `INSERT INTO vocabulary(word) VALUES('jovial') ON CONFLICT(word) DO UPDATE SET count=count+1;`. Is this how you would use it? – Vivek Subramanian Feb 14 '20 at 14:11
  • If so, I will look into how the `transaction` function works to see how I can use the above statement in bulk. – Vivek Subramanian Feb 14 '20 at 14:11
  • 1
    Yes, precisely. And transactions are not functions. You start a transaction, run as many SQL commands as you wish (almost), and commit the transaction (which makes the actual changes in the database). Otherwise, each individual command is a transaction which makes it much less efficient. – sophros Feb 14 '20 at 14:18
  • @VivekSubramanian: as a token of appreciation please consider accepting one of the answers (mine or any other). There is a tick sign on the left of the answer to accept it. You can also upvote it if you haven't yet. – sophros Feb 15 '20 at 05:01
  • 1
    Thank you. Your answer and clarifying comment were both very clear and helpful. – Vivek Subramanian Feb 15 '20 at 21:29
  • Happy to help! Thanks! – sophros Feb 15 '20 at 22:06