using sqldf() to select rows that match a million items

Question

This is a follow up on the answer provided here on using sqldf()

https://stackoverflow.com/a/1820610

In my particular case, I have a tab-delimited file with over 110 million rows. I'd like to select the rows that match 4.6 million tag IDs.

In the following code, the tag IDs are in tag.query

However, while the example will work with a smaller query, it does not handle the above larger example:

sql.query <- paste('select * from f where v2 in (', tag.query, ')', sep='')
selected.df <- sqldf(sql.query, dbname = tempfile(), file.format = list(header = F, row.names = F, sep="\t", skip=line.where.header.is))

Any suggestions on alternative appraoches?

@screechOwl. sqldf does use an actual database. In fact, it can use 4 different actual databases. — G. Grothendieck, Mar 19 '12 at 20:10
My mistake, I didn't realize it had indexing capabilities. Thanks. — screechOwl, Mar 19 '12 at 20:37

G. Grothendieck · Answer 1 · 2012-03-19T20:14:06.600

3

If the problem is speed try creating an index on v2. See example 4i on the sqldf home page. If that still is not fast enough you could also try using a different database. As well as the default SQLite, sqldf supports H2, MySQL and PostgreSQL.

edited Mar 19 '12 at 20:14

answered Mar 19 '12 at 20:07

G. Grothendieck

211,268
15
177
297

score 1 · Answer 2 · answered Mar 19 '12 at 21:22

You would need to index your table indeed, as mentioned earlier. But the database SQLite does not succeed in building an index with more than 10 million records, it becomes extremely slow. I tried with 40 millions records and it freezes. I don't know how other databases perform on CREATE INDEX for big table.

I had the same problem and I ended up sorting the table by tag ID and write it into a text file. Then I wrote a binary search in C++ that was looking directly in the text file for the tag IDs. It sped up the execution tremendously, as a binary search is O(log N) vs O(N) for a grep search, with N in the tens of millions. I can share it if you need.

using sqldf() to select rows that match a million items

2 Answers2