0

I'm trying to do a bulk update but it fails. This is how my table looks like

class Hashes(db.Model):
    __tablename__ = 'Hashes'
    id = db.Column(db.Integer, primary_key=True)
    hash_val = db.Column(db.String(1024), unique=True)
    hash_salt = db.Column(db.String(256))
    hash_plain = db.Column(db.String(256))

Objects are presented as dictionary:

[
    {'hash_val': '40350254ba198f1efcc9f8dc042fd15b', 'hash_plain': '287742velornesjo'},
    {'hash_val': 'a75b1ef3e16f0a5cae736e48137d7c8b', 'hash_plain': 'Mister King'},
    ...
]

And this is how I'm trying to save them in DB:

dbobjects = [
    Hashes(hash_val=x['hash_val'], hash_plain=x['hash_plain']) for x in hash_good
]
db.session.bulk_insert_objects(dbobjects, update_changed_only=True)
db.session.commit()

The error I'm getting is IntegrityError and by looking at the full Traceback it seems that I'm not generating any UPDATE statements but only INSERT. How to fix that?

Traceback (most recent call last)
File "/home/user/envs/project/lib/python2.7/site-packages/flask/app.py", line 1997, in __call__
return self.wsgi_app(environ, start_response)
File "/home/user/envs/project/lib/python2.7/site-packages/flask/app.py", line 1985, in wsgi_app
response = self.handle_exception(e)
File "/home/user/envs/project/lib/python2.7/site-packages/flask/app.py", line 1540, in handle_exception
reraise(exc_type, exc_value, tb)
File "/home/user/envs/project/lib/python2.7/site-packages/flask/app.py", line 1982, in wsgi_app
response = self.full_dispatch_request()
File "/home/user/envs/project/lib/python2.7/site-packages/flask/app.py", line 1614, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/home/user/envs/project/lib/python2.7/site-packages/flask/app.py", line 1517, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/home/user/envs/project/lib/python2.7/site-packages/flask/app.py", line 1612, in full_dispatch_request
rv = self.dispatch_request()
File "/home/user/envs/project/lib/python2.7/site-packages/flask_debugtoolbar/__init__.py", line 125, in dispatch_request
return view_func(**req.view_args)
File "/app/views.py", line 317, in upload
db.session.bulk_save_objects(dbobjects, update_changed_only=True)
File "/home/user/envs/project/lib/python2.7/site-packages/sqlalchemy/orm/scoping.py", line 153, in do
return getattr(self.registry(), name)(*args, **kwargs)
File "/home/user/envs/project/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 2461, in bulk_save_objects
return_defaults, update_changed_only, False)
File "/home/user/envs/project/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 2625, in _bulk_save_mappings
transaction.rollback(_capture_exception=True)
File "/home/user/envs/project/lib/python2.7/site-packages/sqlalchemy/util/langhelpers.py", line 66, in __exit__
compat.reraise(exc_type, exc_value, exc_tb)
File "/home/user/envs/project/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 2620, in _bulk_save_mappings
isstates, return_defaults, render_nulls)
File "/home/user/envs/project/lib/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 69, in _bulk_insert
bookkeeping=return_defaults)
File "/home/user/envs/project/lib/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 830, in _emit_insert_statements
execute(statement, multiparams)
File "/home/user/envs/project/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 948, in execute
return meth(self, multiparams, params)
File "/home/user/envs/project/lib/python2.7/site-packages/sqlalchemy/sql/elements.py", line 269, in _execute_on_connection
return connection._execute_clauseelement(self, multiparams, params)
File "/home/user/envs/project/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1060, in _execute_clauseelement
compiled_sql, distilled_params
File "/home/user/envs/project/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1200, in _execute_context
context)
File "/home/user/envs/project/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1413, in _handle_dbapi_exception
exc_info
File "/home/user/envs/project/lib/python2.7/site-packages/sqlalchemy/util/compat.py", line 203, in raise_from_cause
reraise(type(exception), exception, tb=exc_tb, cause=cause)
File "/home/user/envs/project/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1170, in _execute_context
context)
File "/home/user/envs/project/lib/python2.7/site-packages/sqlalchemy/dialects/mysql/mysqldb.py", line 105, in do_executemany
rowcount = cursor.executemany(statement, parameters)
File "/home/user/envs/project/lib/python2.7/site-packages/pymysql/cursors.py", line 192, in executemany
self._get_db().encoding)
File "/home/user/envs/project/lib/python2.7/site-packages/pymysql/cursors.py", line 229, in _do_execute_many
rows += self.execute(sql + postfix)
File "/home/user/envs/project/lib/python2.7/site-packages/pymysql/cursors.py", line 165, in execute
result = self._query(query)
File "/home/user/envs/project/lib/python2.7/site-packages/pymysql/cursors.py", line 321, in _query
conn.query(q)
File "/home/user/envs/project/lib/python2.7/site-packages/pymysql/connections.py", line 860, in query
self._affected_rows = self._read_query_result(unbuffered=unbuffered)
File "/home/user/envs/project/lib/python2.7/site-packages/pymysql/connections.py", line 1061, in _read_query_result
result.read()
File "/home/user/envs/project/lib/python2.7/site-packages/pymysql/connections.py", line 1349, in read
first_packet = self.connection._read_packet()
File "/home/user/envs/project/lib/python2.7/site-packages/pymysql/connections.py", line 1018, in _read_packet
packet.check_error()
File "/home/user/envs/project/lib/python2.7/site-packages/pymysql/connections.py", line 384, in check_error
err.raise_mysql_exception(self._data)
File "/home/user/envs/project/lib/python2.7/site-packages/pymysql/err.py", line 107, in raise_mysql_exception
raise errorclass(errno, errval)
IntegrityError: (pymysql.err.IntegrityError) (1062, u"Duplicate entry '40350254ba198f1efcc9f8dc042fd15b' for key 'hash_val'") [SQL: u'INSERT INTO `Hashes` (hash_val, hash_plain) VALUES (%(hash_val)s, %(hash_plain)s)'] [parameters: ({'hash_val': '40350254ba198f1efcc9f8dc042fd15b', 'hash_plain': '287742velornesjo'}, {'hash_val': 'a75b1ef3e16f0a5cae736e48137d7c8b', 'hash_plain': 'Mister King'}, {'hash_val': '8ecd03e1fe66c8ee543ff048298af20c', 'hash_plain': 'Farringdon456@'}, {'hash_val': '932d08bfbf87d40ed903f7629a8b3afe', 'hash_plain': '2chilledwater'}, {'hash_val': '8eb47a40e5fdc0e7818d32e0c7fba5b9', 'hash_plain': '1867327El10'}, {'hash_val': '06516de2e5a707621fa6d847e667ce84', 'hash_plain': 'wisky_chocolat$'}, {'hash_val': '7fd1a5943d5b47d98190fc9888131669', 'hash_plain': 'apolloniaime'}, {'hash_val': '93060afe8e7d49dcab6f12c3a9dd146a', 'hash_plain': 'maronmilan89'} ... displaying 10 of 33 total bound parameter sets ... {'hash_val': '14a356c3c4c3775ab183ea24c6a527ce', 'hash_plain': 'dik14424905'}, {'hash_val': 'a2abb9bdff9c730888e64129b89d36aa', 'hash_plain': 'imago2798'})] (Background on this error at: http://sqlalche.me/e/gkpj)
zerocool
  • 415
  • 4
  • 16
  • From the looks of it you're saving new objects only, so why do you expect to see UPDATEs? – Ilja Everilä Apr 03 '18 at 08:26
  • I thought this *update_changed_only=True* option will do it – zerocool Apr 03 '18 at 08:37
  • 1
    It controls if for [*persistent objects*](http://docs.sqlalchemy.org/en/latest/orm/session_state_management.html) only changed attributes should be set. Also there is no `Session.bulk_insert_objects()` method. The unique constrained column does not matter. Only the primary key is used to determine if an object / mapping might have a row in the DB. – Ilja Everilä Apr 03 '18 at 08:45
  • So I have to perform SELECT for IDs and update my dictionary to have hash_val, hash_plain and id from Hashes table first? – zerocool Apr 03 '18 at 09:09
  • You'd query the existing objects, update them, create what's left and bulk save the combination of those, if using `bulk_save_objects`. – Ilja Everilä Apr 03 '18 at 09:28
  • OK but I know upfront that there will be no new rows, only old ones which should be updated with hash_plain actually. And because there will be a lot of them (by lot I mean above 200000) I think bulk update would be fastest option here. Just can't figure this out properly. Could you share any example code? – zerocool Apr 03 '18 at 09:45
  • You'll find an example of a parameterized executemany UPDATE here: http://docs.sqlalchemy.org/en/latest/core/tutorial.html#inserts-updates-and-deletes – Ilja Everilä Apr 03 '18 at 10:30

2 Answers2

1

Following suggestions given by Ilia E. in comments I used following solution

engine = create_engine(app.config['SQLALCHEMY_DATABASE_URI'], echo=False)
DBSession = scoped_session(sessionmaker())
DBSession.remove()
DBSession.configure(bind=engine, autoflush=False, expire_on_commit=False)
DBSession.bulk_update_mappings(
    Hashes,
    z
)
DBSession.commit()

Where 'z' is my new list containing different dictionaries. One problem I had here was that I needed primary key for sqlalchemy so needed to prepare two lists of dictionaries and merge them. First one is known:

[
    {'hash_val': '40350254ba198f1efcc9f8dc042fd15b', 'hash_plain': '287742velornesjo'},
    {'hash_val': 'a75b1ef3e16f0a5cae736e48137d7c8b', 'hash_plain': 'Mister King'},
]

Second one, produced by

ids=hashesindb=db.session.query(Hashes.id, Hashes.hash_val).filter(Hashes.hash_val.in_(x['hash_val'] for x in hash_good)).all()

gave following values

[
    {'hash_val': '40350254ba198f1efcc9f8dc042fd15b', 'id': '10'},
    {'hash_val': 'a75b1ef3e16f0a5cae736e48137d7c8b', 'id': '10'},

]

Then merger:

def merge_lists(l1, l2, key):
    # https://mmxgroup.net/2012/04/12/merging-python-list-of-dictionaries-based-on-specific-key/
    """ returns new list with dictionaries merged from l1 and l2 using key as common value """
    merged = {}
    for item in l1+l2:
        if item[key] in merged:
            merged[item[key]].update(item)
        else:
            merged[item[key]] = item
    return [val for (_, val) in merged.items()]

That's it. UPDATE statement works, perhaps it can be improved further (performance-wise).

EDIT: again, based on comments and great support from Illja, here is a much cleaner and faster solution which does not require any querying for IDs or merging dictionaries. Here is dictionary:

[
    {'b_hash_val': '40350254ba198f1efcc9f8dc042fd15b', 'b_hash_plain': '287742velornesjo'},
    {'b_hash_val': 'a75b1ef3e16f0a5cae736e48137d7c8b', 'b_hash_plain': 'Mister King'},
]

and here is clean UPDATE statement

engine = create_engine(app.config['SQLALCHEMY_DATABASE_URI'], echo=False)
conn = engine.connect()
stmt = Hashes.__table__.update().\
    where(Hashes.hash_val == bindparam('b_hash_val')).\
    values({
        'hash_plain': bindparam('b_hash_plain')
    })
conn.execute(stmt, hash_good)
zerocool
  • 415
  • 4
  • 16
  • Have a look at https://stackoverflow.com/questions/25694234/bulk-update-in-sqlalchemy-core-using-where. Using a Core `update()` construct with bindparams you could skip entirely fetching the ids and merging. Instead you'd simply construct an UPDATE statement with a WHERE clause using the `hash_val`. – Ilja Everilä Apr 04 '18 at 19:03
  • I can't update tables defined they way I did it. I'm getting error:```AttributeError: type object 'Hashes' has no attribute 'update'``` – zerocool Apr 05 '18 at 07:54
  • You need to access the underlying `Table` object for working with Core constructs: `Hashes.__table__.update().where(Hashes.hash_val == bindparam('b_hash_val'))`. Note that you have to rename the key "hash_val" in your dictionaries before performing the *executemany* operation, because of the way SQLAlchemy automatically reserves the column names for value binds. – Ilja Everilä Apr 05 '18 at 07:58
  • It worked, thanks! Is it fine to add another answer to my own question updated with your solution? – zerocool Apr 05 '18 at 08:17
  • I don't see why not. – Ilja Everilä Apr 05 '18 at 08:22
0

From what I see in SQLAlchemy docs, bulk_insert_objects() will not try to differentiate between existing objects and new objects. Therefore you are getting error for duplicate hash keys in your mapping, since your SQL structure says you can't have duplicate hash_val in the table.

Possibly you can use bulk_save_objects() instead, which, as far as I can get from the docs, will generate proper INSERT/UPDATE statements where needed:

http://docs.sqlalchemy.org/en/latest/orm/session_api.html#sqlalchemy.orm.session.Session.bulk_insert_mappings

Note also, that the initial problem seems to be with your mapping itself. There doesn't seem to be sense in it having duplicate entries. If the hash_val's are equal, it means either hash_plains are equal (and you shouldn't be updating the hash at all) or the hashing algorithm is bad and yields equal hash_vals for different hash_plains.

Gnudiff
  • 3,909
  • 1
  • 22
  • 25
  • To be honest I tried both bulk_insert and bulk_save satetements and they both fail with duplicate key. As for hashes - they are unique and can be treated as primary keys even though I have field 'id' which is unique as well. My initial thought was when I use *hash_val* it's enough to update *hash_plains* – zerocool Apr 03 '18 at 08:41
  • @zerocool Whatever your initial thought has been, you are trying to INSERT duplicate hash_val, which will fail because of unique=True constraint you have. Unfortunately, I am not expert on SQL Alchemy, but if it does not contain functions that are able to use underlying SQL's features, you would have to either (a) separate existing hash_vals from new ones in mapping and UPDATE the first, and INSERT the second, or (b) use SQL directly and use MySQL functionality of INSERT ... ON DUPLICATE KEY UPDATE https://stackoverflow.com/questions/4205181/insert-into-a-mysql-table-or-update-if-exists – Gnudiff Apr 03 '18 at 09:00
  • ON DUPLICATE KEY is supported at Core level: https://stackoverflow.com/questions/6611563/sqlalchemy-on-duplicate-key-update/48373874#48373874 – Ilja Everilä Apr 03 '18 at 09:03
  • @IljaEverilä should this work also for bulk inserts then? – Gnudiff Apr 03 '18 at 09:17
  • It compiles to a MySQL INSERT ... ON DUPLICATE KEY UPDATE statement, so yes. – Ilja Everilä Apr 03 '18 at 09:33