1

I am faced with the dilemma of writing two lines of code that empties my table or 50 lines of code that check for duplicates every time I run my spiders. I am very tempted to do the former but I remember reading somewhere that it was bad practice. I was wondering what other more experienced programmers thought of this dilemma.

Tank
  • 451
  • 2
  • 13
  • Is there any particular database you're working on? From my point of view, if cleaning up the table would give you better performance, easier implementation, and understanding in the logic, why not? Logging can potentially be a down side but can be avoided by `truncate` in mysql, for example. – ydoow Dec 08 '16 at 05:05
  • I am using SQL lite. The database is filled with events and everyday the spiders I have created scrape data from event websites. Without emptying the table I obviously get loads of duplicates. – Tank Dec 08 '16 at 05:10

1 Answers1

0

i think it depends on how many columns for each time your record is inserted when comparing with the columns you need to update by performing diff check update.

for example: if your temp table contains 20 columns and you can perform diff check update by just updating 1 column & the table contains a good index for you to find the column to update, then you should go for diff check update.

however, if it required to update as many columns as insertion or it required very complicated condition to find the particular row to update, i think deleting + insert new row is not a bad choice.

also, for deleting the entire table,e you better go for some implemented method instead of delete (for example in sql-server, there is truncate table, which is relatively faster, for sql-lite, you may try VACUUM, but i dont use sqllite often so i cannot sure)

SKLTFZ
  • 730
  • 1
  • 8
  • 22
  • At the moment I am emptying the table -> VACUUM and refilling it. There are 5 columns and about 1000 of rows. Of which 800 could not have been added since they had been scraped the day before. – Tank Dec 08 '16 at 05:15
  • do you means there is only 200 rows required to update for each execution? or delete&insert 1000 rows? if it is the case, and assuming you can index the rows (pending to be updated) very easily for the existing records, then i think diff check update is better. i will apply delete&insert when update is really complicated for the database (doesn't contain good index) or almost all the records required to be updated – SKLTFZ Dec 08 '16 at 05:26
  • I only need to update around 200 events. It depends how many events were on the day that just passed and how many days are on the day that are supposed to be added. The thing is though events might be created and therefore just skipping some days would be innaccurate. – Tank Dec 08 '16 at 05:35
  • 1
    i think the data integrity depends on how you implement your program. for the db performance. according to your situation, you can simply find an average required update row by your previous experience and decide rather or not implement the diff check update. the truth is update is just an hidden record insert and deletion. thus a large amount of update must cause performance issue to your database. and you may reference this too. http://stackoverflow.com/questions/1271641/in-sql-is-update-always-faster-than-deleteinsert – SKLTFZ Dec 08 '16 at 05:46