13

I am interested if there is a simple way to import a mysqldump into Pandas.

I have a few small (~110MB) tables and I would like to have them as DataFrames.

I would like to avoid having to put the data back into a database since that would require installation/connection to such a data base. I have the .sql files and want to import the contained tables into Pandas. Does any module exist to do this?

If versioning matters the .sql files all list "MySQL dump 10.13 Distrib 5.6.13, for Win32 (x86)" as the system the dump was produced in.

Background in hindsight

I was working locally on a computer with no data base connection. The normal flow for my work was to be given a .tsv, .csv or json from a third party and to do some analysis which would be given back. A new third party gave all their data in .sql format and this broke my workflow since I would need a lot of overhead to get it into a format which my programs could take as input. We ended up asking them to send the data in a different format but for business/reputation reasons wanted to look for a work around first.

Edit: Below is Sample MYSQLDump File With two tables.

/*
MySQL - 5.6.28 : Database - ztest
*********************************************************************
*/


/*!40101 SET NAMES utf8 */;

/*!40101 SET SQL_MODE=''*/;

/*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
/*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;
/*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;
/*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/`ztest` /*!40100 DEFAULT CHARACTER SET latin1 */;

USE `ztest`;

/*Table structure for table `food_in` */

DROP TABLE IF EXISTS `food_in`;

CREATE TABLE `food_in` (
  `ID` int(11) NOT NULL AUTO_INCREMENT,
  `Cat` varchar(255) DEFAULT NULL,
  `Item` varchar(255) DEFAULT NULL,
  `price` decimal(10,4) DEFAULT NULL,
  `quantity` decimal(10,0) DEFAULT NULL,
  KEY `ID` (`ID`)
) ENGINE=InnoDB AUTO_INCREMENT=10 DEFAULT CHARSET=latin1;

/*Data for the table `food_in` */

insert  into `food_in`(`ID`,`Cat`,`Item`,`price`,`quantity`) values 

(2,'Liq','Beer','2.5000','300'),

(7,'Liq','Water','3.5000','230'),

(9,'Liq','Soda','3.5000','399');

/*Table structure for table `food_min` */

DROP TABLE IF EXISTS `food_min`;

CREATE TABLE `food_min` (
  `Item` varchar(255) DEFAULT NULL,
  `quantity` decimal(10,0) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

/*Data for the table `food_min` */

insert  into `food_min`(`Item`,`quantity`) values 

('Pizza','300'),

('Hotdogs','200'),

('Beer','300'),

('Water','230'),

('Soda','399'),

('Soup','100');

/*!40101 SET SQL_MODE=@OLD_SQL_MODE */;
/*!40014 SET FOREIGN_KEY_CHECKS=@OLD_FOREIGN_KEY_CHECKS */;
/*!40014 SET UNIQUE_CHECKS=@OLD_UNIQUE_CHECKS */;
/*!40111 SET SQL_NOTES=@OLD_SQL_NOTES */;
Community
  • 1
  • 1
Keith
  • 4,117
  • 5
  • 40
  • 62
  • After some research it looks like there is no library/module to do this. I will leave the question in the hopes that eventually there is. – Keith Dec 28 '14 at 20:31
  • 1
    @Merlin: Someone has apparently created [a Python script](https://github.com/jamesmishra/mysqldump-to-csv) to convert mysqldump into CSV, if that helps. – BrenBarn May 17 '16 at 06:30

5 Answers5

20

No

Pandas has no native way of reading a mysqldump without it passing through a database.

There is a possible workaround, but it is in my opinion a very bad idea.

Workaround (Not recommended for production use)

Of course you could parse the data from the mysqldump file using a preprocessor.

MySQLdump files often contain a lot of extra data we are not interested in when loading a pandas dataframe, so we need to preprocess it and remove noise and even reformat lines so that they conform.

Using StringIO we can read a file, process the data before it is fed to the pandas.read_csv funcion

from StringIO import StringIO
import re

def read_dump(dump_filename, target_table):
    sio = StringIO()
        
    fast_forward = True
    with open(dump_filename, 'rb') as f:
        for line in f:
            line = line.strip()
            if line.lower().startswith('insert') and target_table in line:
                fast_forward = False
            if fast_forward:
                continue
            data = re.findall('\([^\)]*\)', line)
            try:
                newline = data[0]
                newline = newline.strip(' ()')
                newline = newline.replace('`', '')
                sio.write(newline)
                sio.write("\n")
            except IndexError:
                pass
            if line.endswith(';'):
                break
    sio.pos = 0
    return sio

Now that we have a function that reads and formatts the data to look like a CSV file, we can read it with pandas.read_csv()

import pandas as pd

food_min_filedata = read_dump('mysqldumpexample', 'food_min')
food_in_filedata = read_dump('mysqldumpexample', 'food_in')

df_food_min = pd.read_csv(food_min_filedata)
df_food_in = pd.read_csv(food_in_filedata)

Results in:

        Item quantity
0    'Pizza'    '300'
1  'Hotdogs'    '200'
2     'Beer'    '300'
3    'Water'    '230'
4     'Soda'    '399'
5     'Soup'    '100'

and

   ID    Cat     Item     price quantity
0   2  'Liq'   'Beer'  '2.5000'    '300'
1   7  'Liq'  'Water'  '3.5000'    '230'
2   9  'Liq'   'Soda'  '3.5000'    '399'

Note on Stream processing

This approach is called stream processing and is incredibly streamlined, almost taking no memory at all. In general it is a good idea to use this approach to read csv files more efficiently into pandas.

It is the parsing of a mysqldump file I advice against

Community
  • 1
  • 1
firelynx
  • 24,845
  • 5
  • 82
  • 91
  • I don't understand the negativity towards parsing SQL dump files, especially given that he controls their generation as well and can be fairly certain that any future dumps will be in the same format if he uses same software version and same command line arguments. It should be an order of magnitude faster to read this data directly from disk instead of loading it to a transactional database and then reading it out again. – NikoNyrh May 17 '16 at 13:59
  • 1
    @NikoNyrh Changing mysql version may change the layout of the dump, which means you may have to rewrite your code. This is per definition close coupling and is an anti-pattern. – firelynx May 17 '16 at 14:02
  • True that parsing unconstrained file isn't ideal but in this case it still seems the best solution. And if it doesn't work you'll get some error message and you can tweak the code to accommodate small deviations. But this is getting opinionated, glad you still provided sample code for him. – NikoNyrh May 17 '16 at 14:13
  • 1
    @NikoNyrh I just don't want people to consider this answer as a "great idea" or "something that should be fine to do without considering the consequences". I especially don't want someone to say "I shouldn't have copypasted firelynx's code, he sucks". Because a lot of people copypaste code from StackOverflow and few answers here provide warnings to why copypasting the code may be a bad idea. – firelynx May 17 '16 at 14:26
  • Hey guys, just a little context so you understand my original question. I was working locally on a computer with no data base connection. The normal flow for my work was to be given a .tsv, .csv or json from a third party and to do some analysis which would be given back. A new third party gave all their data in .sql format and this broke my workflow since I would need a lot of overhead to get it into a format which my programs could take as input. We ended up asking them to send the data in a different format but for business/reputation reasons wanted to look for a work around first. Thanks! – Keith May 18 '16 at 19:03
  • Hey @Keith. Thanks for the extra background, it's always good to have. I will add that to your original question if you don't mind. Also, the workaround I built in my answer is not recommended for production use. If you are hacking stuff locally, it's quite alright to use in my opinion. I will add this to my answer. – firelynx May 19 '16 at 06:56
  • Importing data is always hacky but this does seem a little more than normal. I still am unconvinced that a function could not be put into pandas which works 99% of the time. – Keith May 19 '16 at 17:21
  • @Keith: Any function that could be put into pandas would have to be doing basically the same thing as this, and would be subject to similar risks. The basic problem is that SQL itself is a standardized API and you can rely on reading from an SQL database in a consistent way, but MySQLDump is just a MySQL-internal format that could change between MySQL versions. – BrenBarn May 20 '16 at 19:01
  • @BrenBarn I would be totally OK with that risk and would use the function. The best way to get around it would be to have a MySQL Version flag that throws a warning unless it is set to a value which has been tested. – Keith May 20 '16 at 21:30
  • One more amendment: I received an sqldump where I didn't control its parameters. One particularly large table has multiple "insert" lines, so where you have "if line.endswith(';'): break" I modified to: "read_mode=0; continue" That causes the loop to process multiple inserts. – Frank M May 03 '18 at 00:17
  • I still do not understand where the problem is here and why it is coupled? Its only a string we are interested in. Using regular expression should give the desired values (without header) as it is a pattern which is related to mySQL definition. We can assume, that the values are inserted always with insert into statements. Remove the noice by using regex should give the results.... – Peter Ebelsberger Mar 20 '20 at 23:08
  • In uncle bobs book "Clean Code" there is a chapter about Boundaries. There is a clear boundary between the mysql, its schema, mysqldumps representation of it -and- pandas parsing that output. – firelynx Mar 21 '20 at 12:25
4

One way is to export mysqldump to sqlite (e.g. run this shell script) then read the sqlite file/database.

See the SQL section of the docs:

pd.read_sql_table(table_name, sqlite_file)

Another option is just to run read_sql on the mysql database directly...

Community
  • 1
  • 1
Andy Hayden
  • 291,328
  • 80
  • 565
  • 500
  • I am trying to avoid putting the data back into a database and would like to read the dump files directly. I have updated the question. – Keith Dec 20 '14 at 21:56
  • @Keith pandas doesn't do updates efficiently (it's not a database!) so generally you want to construct in one go. – Andy Hayden Dec 20 '14 at 22:20
  • I am not sure I get your meaning. I want to analyze some data in Pandas given as a mysqldump. Normally I am used to getting .tsv files which are super easy to import. I was hoping that the change in format would not significantly alter my work flow. – Keith Dec 20 '14 at 23:58
  • @Kevin Could you give a sample of the mysqldump output? I was under the impression it would be some crazy SQL query (with updates and values). *There are a couple of libs that allow sql queries on pandas objects (but like I say pandas objects do not update piecemeal efficiently) https://github.com/yhat/pandasql and (I'm sure there's [another](https://pypi.python.org/pypi?%3Aaction=search&term=pandas&submit=search) but can't recall it).* – Andy Hayden Dec 21 '14 at 02:01
  • mysqldump is a standardized file format to transfer tables. – Keith Dec 28 '14 at 20:32
  • Do you have an example? Will have a look later. – Andy Hayden Dec 28 '14 at 22:37
  • @Keith just to step back a little here, could you not export a csv for each table (and use `pandas.read_csv`)? – Andy Hayden Jan 08 '15 at 02:18
  • Yep of course I could but this is not the question. The use case is stripped down a little but the basics is that people generally use a mysqldump to pass tables between eachother. I want to know if I have to read that in with mysql at all or can I just use python – Keith Jan 09 '15 at 06:22
  • not working . got error : Could not parse rfc1738 URL from string 'backup.sql' – Nurul Akter Towhid Oct 07 '16 at 03:45
3

I found myself in a similar situation to yours, and the answer from @firelynx was really helpful!

But since I had only limited knowledge of the tables included in the file, I extended the script by adding the header generation (pandas picks it up automatically), as well as searching for all the tables within the dump file. As a result, I ended up with a following script, that indeed works extremely fast. I switched to io.StringIO, and save the resulting tables as table_name.csv files.

P.S. I also support the advise against relying on this approach, and provide the code just for illustration purposes :)

So, first thing first, we can augment the read_dump function like this

from io import StringIO
import re, shutil

def read_dump(dump_filename, target_table):
    sio = StringIO()

    read_mode = 0 # 0 - skip, 1 - header, 2 - data
    with open(dump_filename, 'r') as f:
        for line in f:
            line = line.strip()
            if line.lower().startswith('insert') and target_table in line:
                read_mode = 2
            if line.lower().startswith('create table') and target_table in line:
                read_mode = 1
                continue

            if read_mode==0:
                continue

            # Filling up the headers
            elif read_mode==1:
                if line.lower().startswith('primary'):
                    # add more conditions here for different cases 
                    #(e.g. when simply a key is defined, or no key is defined)
                    read_mode=0
                    sio.seek(sio.tell()-1) # delete last comma
                    sio.write('\n')
                    continue
                colheader = re.findall('`([\w_]+)`',line)
                for col in colheader:
                    sio.write(col.strip())
                    sio.write(',')

            # Filling up the data -same as @firelynx's code
            elif read_mode ==2:
                data = re.findall('\([^\)]*\)', line)
                try:
                    # ...
                except IndexError:
                    pass
                if line.endswith(';'):
                    break
    sio.seek(0)
    with open (target_table+'.csv', 'w') as fd:
        shutil.copyfileobj(sio, fd,-1)
    return # or simply return sio itself

To find the list of tables we can use the following function:

def find_tables(dump_filename):
    table_list=[]

    with open(dump_filename, 'r') as f:
        for line in f:
            line = line.strip()
            if line.lower().startswith('create table'):
                table_name = re.findall('create table `([\w_]+)`', line.lower())
                table_list.extend(table_name)

    return table_list

Then just combine the two, for example in a .py script that you'll run like

python this_script.py mysqldump_name.sql [table_name]

import os.path
def main():
    try:
        if len(sys.argv)>=2 and os.path.isfile(sys.argv[1]):
            if len(sys.argv)==2:
                print('Table name not provided, looking for all tables...')
                table_list = find_tables(sys.argv[1])
                if len(table_list)>0:
                    print('Found tables: ',str(table_list))
                    for table in table_list:
                        read_dump(sys.argv[1], table)
            elif len(sys.argv)==3:
                read_dump(sys.argv[1], sys.argv[2])
    except KeyboardInterrupt:
        sys.exit(0)
Tony S.
  • 31
  • 5
  • Maybe breaking up your wall of code into segments and explaining what each segment does would help your answer be more easily consumed – firelynx Jul 31 '17 at 11:34
  • I guess you're right, showing the complete script here is more of a complete solution, which will just promote the blind copy-pasting of the code... – Tony S. Jul 31 '17 at 12:05
1

I would like to share my solution about this problem and ask for feedback:

import pandas as pd
import re
import os.path
import csv
import logging
import sys


def convert_dump_to_intermediate_csv(dump_filename, csv_header, csv_out_put_file, delete_csv_file_after_read=True):
    """
    :param dump_filename: five an mysql export dump (mysqldump...syntax)
    :param csv_header: the very first line in the csv file which should appear, give a string separated by coma
    :param csv_out_put_file: the name of the csv file
    :param delete_csv_file_after_read: if you set this to False, no new records will be written as the file exists.
    :return: returns a pandas dataframe for further analysis.
    """
    with open(dump_filename, 'r') as f:
        for line in f:
            pre_compiled_all_values_per_line = re.compile('(?:INSERT\sINTO\s\S[a-z\S]+\sVALUES\s+)(?P<values>.*)(?=\;)')
            result = pre_compiled_all_values_per_line.finditer(line)
            for element in result:
                values_only = element.groups('values')[0]
                value_compile = re.compile('\(.*?\)')
                all_identified = value_compile.finditer(values_only)
                for single_values in all_identified:
                    string_to_split = single_values.group(0)[1:-1]
                    string_array = string_to_split.split(",")

                    if not os.path.exists(csv_out_put_file):
                        with open(csv_out_put_file, 'w', newline='') as file:
                            writer = csv.writer(file)
                            writer.writerow(csv_header.split(","))
                            writer.writerow(string_array)
                    else:
                        with open(csv_out_put_file, 'a', newline='') as file:
                            writer = csv.writer(file)
                            writer.writerow(string_array)
    df = pd.read_csv(csv_out_put_file)
    if delete_csv_file_after_read:
        os.remove(csv_out_put_file)
    return df


if __name__ == "__main__":
    log_name = 'test.log'
    LOGGER = logging.getLogger(log_name)
    LOGGER.setLevel(logging.DEBUG)
    LOGGER.addHandler(logging.NullHandler())
    FORMATTER = logging.Formatter(
        fmt='%(asctime)s %(levelname)-8s %(message)s',
        datefmt='%Y-%m-%d %H:%M:%S')
    SCREEN_HANDLER = logging.StreamHandler(stream=sys.stdout)
    SCREEN_HANDLER.setFormatter(FORMATTER)
    LOGGER.addHandler(SCREEN_HANDLER)

    dump_filename = 'test_sql.sql'
    header_of_csv_file = "A,B,C,D,E,F,G,H,I" # i did not identify the columns in the table definition...
    csv_output_file = 'test.csv'
    pandas_df = convert_dump_to_intermediate_csv(dump_filename, header_of_csv_file, csv_output_file, delete_csv_file_after_read=False)
    LOGGER.debug(pandas_df)

Of course, logger part can be removed.... EDIT: Did not see that this topic is that old. I am sorry for this.

1

I was working locally on a computer with no data base connection. The normal flow for my work was to be given a .tsv

Try the mysqltotsv pypi module:

pip3 install --user mysqltotsv
python3 mysql-to-tsv.py --file dump.sql --outdir out1

This will produce multiple .tsv files in the out1 directory (one .tsv file for each table found in the MySQL dump). And from there on you can continue your normal workflow with Pandas by loading the TSV files.

wsdookadr
  • 2,304
  • 1
  • 16
  • 40