0

I am trying to import a recent Wikipedia dump into a MySQL database. The problem is, I am inserting a 50 GiB text table using INSERT INTO text MySQL commands and I want to convert these into a text file.

My text.sql file has the following structure:

INSERT INTO text (old_id,old_text,old_flags) VALUES (id1,'text1','flags1'),(id2,'text2','flags2'),...,(idN,'textN','flagsN');

However, using mysql -u USERNAME -p DBNAME < text.sql is very slow. I am already disabling autocommit,unique_checks and foreign_key_checks, and I am enclosing all transactions within a START TRANSACTION; ... COMMIT; block, but the import process is still very slow.

After researching, I read here that using LOAD DATA INFILE; can be much faster than using INSERT commands. Therefore, I am looking to convert text.sql to text.txt as follows:

id1,'text1','flags1'
id2,'text2','flags2'
...
idN,'textN','flagsN'

I was thinking of using awk for this, but my experience with regular expressions is very limited. Furthermore, each INSERT command is given in a single line, as shown above, making it for me even more difficult to extract the values.

Given that the text.sql file is 50 GiB, would you recommend using awk or to develop a C/C++ program? If awk is a good approach, how could I achieve the conversion?

Input #1 example:

INSERT INTO text (old_id,old_text,old_flags) VALUES (id1,'text1','flags1'),(id2,'text2','flags2'),(id3,'text3','flags3');

Output #1 example:

id1,'text1','flags1'
id2,'text2','flags2'
id3,'text3','flags3'

Input #2 example: (with parenthesis in the values)

INSERT INTO page (page_id,page_namespace,page_title,page_restrictions,page_is_redirect,page_is_new,page_random,page_touched,page_latest,page_len,page_content_model) VALUES (10,0,'AccessibleComputing','',1,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,631144794,69,'wikitext'),(12,0,'Anarchism','',0,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,703037144,180446,'wikitext');

Output #2 example:

10,0,'AccessibleComputing','',1,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,631144794,69,'wikitext'
12,0,'Anarchism','',0,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,703037144,180446,'wikitext'

Input #3 example: (with escaped ' or ")

INSERT INTO text (old_id,old_text,old_flags) VALUES (631144794,'#REDIRECT [[Computer accessibility]]\n\n{{Redr|move|from CamelCase|up}}','utf-8'),(703037144,'{{Redirect2|Anarchist|Anarchists

|the fictional character|Anarchist (comics)|other uses|Anarchists (disambiguation)}}\n{{pp-move-indef}}\n{{Use British English|date=January 2014}}','utf-8');

Output #3 example:

631144794,'#REDIRECT [[Computer accessibility]]\n\n{{Redr|move|from CamelCase|up}}','utf-8'
703037144,'{{Redirect2|Anarchist|Anarchists|the fictional character|Anarchist (comics)|other uses|Anarchists (disambiguation)}}\n{{pp-move-indef}}\n{{Use British English|date=January 2014}}','utf-8'

edit: after conducting some more research, it appears that examples #2 and #3 may not be converted using regular expressions: sources: #1, #2.

Adama
  • 700
  • 5
  • 21
  • I'd split the file up into smaller chunks once converted, e.g. using `split --lines n` instead of one 50G file – Cez Mar 20 '16 at 15:16

2 Answers2

1

If this isn't what you want:

$ awk -v FPAT='[(][^)]+[)]' '{for (i=2;i<=NF;i++) print substr($i,2,length($i)-2)}' file
id1,'text1','flags1'
id2,'text2','flags2'
idN,'textN','flagsN'

then edit your question to provide clearer, testable sample input and expected output.

The above used GNU awk for FPAT, with other awks you'd use a while(match()) loop.

Ed Morton
  • 157,421
  • 15
  • 62
  • 152
  • You are correct, Ed, I've edited my question as it wasn't very clear. All values in a single INSERT statement are in a single line; your code snippet would indeed work if they were in different lines (as I had originally written). I've now provided a clear input & output example. – Adama Mar 20 '16 at 03:04
  • I edited my answer, take a look now. We do frequently see people post sample input that doesn't look like their real input - I sometimes wonder if at a home they ask for help with how to fry an egg when they're really grilling a steak :-). – Ed Morton Mar 20 '16 at 13:12
  • Cool thanks. It works for example #1 and indeed it converted one table (redirects). However, other tables have parentheses or escaped ``\'`` or ``\"`` in their values. I've added examples #2 and #3 to demonstrate this. Btw, after looking up FPAT, it turned out that I needed to install ``gawk``; regular ``awk`` won't work there. :-) – Adama Mar 20 '16 at 15:13
  • @Adam see the last sentence in my answer where I told you you needed GNU awk for FPAT and what to do with other awks. I'm sorry, I'm not willing to peel the onion with you as you post more and more examples of your real input one or 2 at a time, maybe someone else will. Good luck! – Ed Morton Mar 20 '16 at 16:11
  • 1
    Yes, I should had posted indeed all use cases in the beginning. So, after conducting some more research I found out that regular expressions should not be used for the latter examples. Since my question was about regular expressions and your solution worked for the only example where they could be applied to... I'm covered. :-) Thanks. – Adama Mar 21 '16 at 17:53
0

Use this:

sed -e 's/(//' -e 's/),//' test.csv

(appropriately piped) and all your lines will be clean.

Change first and last lines manually.

Regards

White Feather
  • 2,439
  • 1
  • 12
  • 21