Create Pandas DataFrame from a string

Question

In order to test some functionality I would like to create a DataFrame from a string. Let's say my test data looks like:

TESTDATA="""col1;col2;col3
1;4.4;99
2;4.5;200
3;4.7;65
4;3.2;140
"""

What is the simplest way to read that data into a Pandas DataFrame?

score 601 · Accepted Answer · edited Jun 17 '19 at 14:43

601

A simple way to do this is to use StringIO.StringIO (python2) or io.StringIO (python3) and pass that to the pandas.read_csv function. E.g:

import sys
if sys.version_info[0] < 3: 
    from StringIO import StringIO
else:
    from io import StringIO

import pandas as pd

TESTDATA = StringIO("""col1;col2;col3
    1;4.4;99
    2;4.5;200
    3;4.7;65
    4;3.2;140
    """)

df = pd.read_csv(TESTDATA, sep=";")

edited Jun 17 '19 at 14:43

Gustavo Lopes

2,667
1
15
40

answered Mar 24 '14 at 09:21

Emil H

16,495
3
39
62

16

Iff you need code that is compatible with both Python 2 and 3, you can optionally also use `from pandas.compat import StringIO`, noting that it's the same class as the one that comes with Python. – Acumenus Sep 27 '17 at 17:32
4

FYI - `pd.read_table()` is an equivalent function, just slightly better nomenclature: `df = pd.read_table(TESTDATA, sep=";")`. – wkzhu Dec 06 '17 at 23:17
7

@AntonvBR Noted that one could use `pandas.compat.StringIO`. That way we don't have to import `StringIO` separately. However the `pandas.compat` package is considered private according to http://pandas.pydata.org/pandas-docs/stable/api.html?highlight=compat so leaving the answer as is for now. – Emil H Dec 12 '17 at 06:04
Time to sort out which import: [Should we use pandas.compat.StringIO or Python 2/3 StringIO?](https://stackoverflow.com/questions/50283292/should-we-use-pandas-compat-stringio-or-python-2-3-stringio) – smci May 11 '18 at 00:30
If you create TESTDATA with `df.to_csv(TESTDATA)`, use `TESTDATA.seek(0)` – user3226167 Oct 24 '19 at 08:30
I receive 'Error tokenizing data. C error: Expected 2 fields in line 26, saw 12\n',) – gdm Jul 31 '20 at 08:49
This is what I needed to do when uploading CSV files in colab. In order to make a dataframe out of the CSV, I used this method – Joshua Swain Jan 05 '21 at 13:12
@gdm The "Expected fields error" is usually due to inconsistent number of fields in each row, or wrong sep/delimiter. – skytaker Mar 26 '21 at 16:38

shaurya uppal · Answer 2 · 2019-04-14T02:19:25.633

25

Split Method

data = input_string
df = pd.DataFrame([x.split(';') for x in data.split('\n')])
print(df)

edited Apr 14 '19 at 02:19

answered Feb 21 '19 at 17:27

shaurya uppal

2,335
23
28

7

If you want the first line to be used for column names, change the 2nd line to this: `df = pd.DataFrame([x.split(';') for x in data.split('\n')[1:]], columns=[x for x in data.split('\n')[0].split(';')])` – Mabyn Oct 18 '19 at 01:34
1

This is wrong, since on CSV files the newline (\n) character can be part of a field. – Antonio Ercole De Luca Apr 03 '20 at 13:13
1

This is not very robust, and most people would be better with the accepted answer. There is a very partial list of things that can go wrong with this at http://thomasburette.com/blog/2014/05/25/so-you-want-to-write-your-own-CSV-code/ – DanB May 15 '20 at 17:21

score 17 · Answer 3 · answered Nov 30 '18 at 08:17

A quick and easy solution for interactive work is to copy-and-paste the text by loading the data from the clipboard.

Select the content of the string with your mouse:

In the Python shell use read_clipboard()

>>> pd.read_clipboard()
  col1;col2;col3
0       1;4.4;99
1      2;4.5;200
2       3;4.7;65
3      4;3.2;140

Use the appropriate separator:

>>> pd.read_clipboard(sep=';')
   col1  col2  col3
0     1   4.4    99
1     2   4.5   200
2     3   4.7    65
3     4   3.2   140

>>> df = pd.read_clipboard(sep=';') # save to dataframe

Not good for reproducibility, but otherwise a pretty neat solution! — Mabyn, Oct 18 '19 at 01:27

score 11 · Answer 4 · answered Nov 26 '20 at 09:10

11

In one line, but first import IO

import pandas as pd
import io   

TESTDATA="""col1;col2;col3
1;4.4;99
2;4.5;200
3;4.7;65
4;3.2;140
"""

df = pd.read_csv(  io.StringIO(TESTDATA)  , sep=";")
print ( df )

answered Nov 26 '20 at 09:10

user3810512

119
1
3

What is the difference between this and the accepted answer? Except you move io operation to read_csv, which makes no difference... Please always check if similar answer is not posted already, redundancy is unnecessary. – Ruli Nov 26 '20 at 09:35

Acumenus · Answer 5 · 2020-06-24T22:02:42.070

This answer applies when a string is manually entered, not when it's read from somewhere.

A traditional variable-width CSV is unreadable for storing data as a string variable. Especially for use inside a .py file, consider fixed-width pipe-separated data instead. Various IDEs and editors may have a plugin to format pipe-separated text into a neat table.

Using `read_csv`

Store the following in a utility module, e.g. util/pandas.py. An example is included in the function's docstring.

import io
import re

import pandas as pd


def read_psv(str_input: str, **kwargs) -> pd.DataFrame:
    """Read a Pandas object from a pipe-separated table contained within a string.

    Input example:
        | int_score | ext_score | eligible |
        |           | 701       | True     |
        | 221.3     | 0         | False    |
        |           | 576       | True     |
        | 300       | 600       | True     |

    The leading and trailing pipes are optional, but if one is present,
    so must be the other.

    `kwargs` are passed to `read_csv`. They must not include `sep`.

    In PyCharm, the "Pipe Table Formatter" plugin has a "Format" feature that can 
    be used to neatly format a table.

    Ref: https://stackoverflow.com/a/46471952/
    """

    substitutions = [
        ('^ *', ''),  # Remove leading spaces
        (' *$', ''),  # Remove trailing spaces
        (r' *\| *', '|'),  # Remove spaces between columns
    ]
    if all(line.lstrip().startswith('|') and line.rstrip().endswith('|') for line in str_input.strip().split('\n')):
        substitutions.extend([
            (r'^\|', ''),  # Remove redundant leading delimiter
            (r'\|$', ''),  # Remove redundant trailing delimiter
        ])
    for pattern, replacement in substitutions:
        str_input = re.sub(pattern, replacement, str_input, flags=re.MULTILINE)
    return pd.read_csv(io.StringIO(str_input), sep='|', **kwargs)

Non-working alternatives

The code below doesn't work properly because it adds an empty column on both the left and right sides.

df = pd.read_csv(io.StringIO(df_str), sep=r'\s*\|\s*', engine='python')

As for read_fwf, it doesn't actually use so many of the optional kwargs that read_csv accepts and uses. As such, it shouldn't be used at all for pipe-separated data.

I found (by trial&error) that `read_fwf` takes more of `read_csv`s arguments than is documented, but it's true that [some have no effect](https://stackoverflow.com/q/59757478/974555). — gerrit, Jan 20 '20 at 16:12

score -3 · Answer 6 · answered Jan 31 '20 at 12:42

-3

Simplest way is to save it to temp file and then read it:

import pandas as pd

CSV_FILE_NAME = 'temp_file.csv'  # Consider creating temp file, look URL below
with open(CSV_FILE_NAME, 'w') as outfile:
    outfile.write(TESTDATA)
df = pd.read_csv(CSV_FILE_NAME, sep=';')

Right way of creating temp file: How can I create a tmp file in Python?

answered Jan 31 '20 at 12:42

QtRoS

1,017
1
16
22

what if there's no permission to create file? – BingLi224 Mar 01 '20 at 04:41
In my opinion it is not simplest case anymore. Note that "simplest" is stated explicitly in the question. – QtRoS Mar 03 '20 at 09:04

Create Pandas DataFrame from a string

6 Answers6

Using `read_csv`

Non-working alternatives

Linked

Related

Create Pandas DataFrame from a string

6 Answers6

Using read_csv

Non-working alternatives

Linked

Related

Using `read_csv`