How to perform a SQL-like Join in Perl?

Question

I have to process some data by combining two different files. Both of them have two columns that would form a primary key that I can use to match them side-by-side. The files in questions are huge (around 5GB with 20 million rows) so I would need an efficient code. How would I do this in Perl?

I give an example:

If File A contains columns

id, name, lastname, dob, school

File B contains columns

address, id, postcode, dob, email

I would need to join these two files by matching id and dob in the two files to have an output file that would have the columns:

 id, name, lastname, dob, school, address, postcode, email

score 8 · Answer 1 · answered Jan 03 '12 at 12:53

8

Think I would just create a new mysql/sqlite/whatever DB and insert the rows. Should be ~20 lines of perl.

This, of course, requires easy access to a DB..

Guess you could also sort the files by the interesting fields and then for each line in file1 find and print the matching lines in file2.

answered Jan 03 '12 at 12:53

Øyvind Skaar

2,090
13
15

2

... and you can build a copy of SQLite directly from CPAN (DBD::SQLite). Make it a point to use large transactions when inserting a lot of data into SQLite, by the way. – tsee Jan 04 '12 at 17:54

score 2 · Answer 2 · answered Jan 03 '12 at 17:09

2

The old fashioned way to do this is to use system utilities to sort both files in key sequence and then match them line by line. Read both files, if the keys match output the data. If they don't match, read the file with the lesser key until they do match. Set the key infinitely high for a file if it hits eof. When both keys are infinitely high, you're done.

answered Jan 03 '12 at 17:09

Bill Ruppert

8,736
7
24
43

The system utility `join` will even do the join for you, if its inputs are sorted. – reinierpost Jan 09 '12 at 14:27
Awesome, didn't know that. Thanks. – Bill Ruppert Jan 09 '12 at 16:34

score 0 · Answer 3 · answered Jan 03 '12 at 16:29

0

Or, peruse this nice Techrepublic article - you are still liable to need 5G of memory, though. I wonder where using the unix/linux CLI sort/join utilities would take you, efficiencywise. Just a thought.

answered Jan 03 '12 at 16:29

Alien Life Form

1,820
1
18
25

score 0 · Answer 4 · answered Jan 04 '12 at 10:54

I haven't actually tried this, but a more creative solution could be:

Read each file once and create a map between the unique id+dob combinations and their positions in the file. Use tell().
Create the map in perl
Read the actual data from the files using the positions int he map and sysread()
Write the data to a new file

score 0 · Answer 5 · answered Jan 09 '12 at 06:10

You can also use my 3-year-old CPAN module Set::Relation which is designed to do things like this, letting you do all the SQL features such as join in Perl. Create a Set::Relation object for each file and then use the join() method. That said, this module as-implemented will keep all your operands and result in memory, so it is limited by your RAM. But you can still look at its source for how join() works and then implement a more efficient version for your purposes based on it.

score 0 · Answer 6 · answered Jan 09 '12 at 12:41

0

Also, you can try DBD::AnyData

answered Jan 09 '12 at 12:41

KneLL

493
4
10

How to perform a SQL-like Join in Perl?

6 Answers6