336

In order to test some functionality I would like to create a DataFrame from a string. Let's say my test data looks like:

TESTDATA="""col1;col2;col3
1;4.4;99
2;4.5;200
3;4.7;65
4;3.2;140
"""

What is the simplest way to read that data into a Pandas DataFrame?

Acumenus
  • 41,481
  • 14
  • 116
  • 107
Emil H
  • 16,495
  • 3
  • 39
  • 62

6 Answers6

601

A simple way to do this is to use StringIO.StringIO (python2) or io.StringIO (python3) and pass that to the pandas.read_csv function. E.g:

import sys
if sys.version_info[0] < 3: 
    from StringIO import StringIO
else:
    from io import StringIO

import pandas as pd

TESTDATA = StringIO("""col1;col2;col3
    1;4.4;99
    2;4.5;200
    3;4.7;65
    4;3.2;140
    """)

df = pd.read_csv(TESTDATA, sep=";")
Gustavo Lopes
  • 2,667
  • 1
  • 15
  • 40
Emil H
  • 16,495
  • 3
  • 39
  • 62
  • 16
    Iff you need code that is compatible with both Python 2 and 3, you can optionally also use `from pandas.compat import StringIO`, noting that it's the same class as the one that comes with Python. – Acumenus Sep 27 '17 at 17:32
  • 4
    FYI - `pd.read_table()` is an equivalent function, just slightly better nomenclature: `df = pd.read_table(TESTDATA, sep=";")`. – wkzhu Dec 06 '17 at 23:17
  • 7
    @AntonvBR Noted that one could use `pandas.compat.StringIO`. That way we don't have to import `StringIO` separately. However the `pandas.compat` package is considered private according to http://pandas.pydata.org/pandas-docs/stable/api.html?highlight=compat so leaving the answer as is for now. – Emil H Dec 12 '17 at 06:04
  • Time to sort out which import: [Should we use pandas.compat.StringIO or Python 2/3 StringIO?](https://stackoverflow.com/questions/50283292/should-we-use-pandas-compat-stringio-or-python-2-3-stringio) – smci May 11 '18 at 00:30
  • If you create TESTDATA with `df.to_csv(TESTDATA)`, use `TESTDATA.seek(0)` – user3226167 Oct 24 '19 at 08:30
  • I receive 'Error tokenizing data. C error: Expected 2 fields in line 26, saw 12\n',) – gdm Jul 31 '20 at 08:49
  • This is what I needed to do when uploading CSV files in colab. In order to make a dataframe out of the CSV, I used this method – Joshua Swain Jan 05 '21 at 13:12
  • @gdm The "Expected fields error" is usually due to inconsistent number of fields in each row, or wrong sep/delimiter. – skytaker Mar 26 '21 at 16:38
25

Split Method

data = input_string
df = pd.DataFrame([x.split(';') for x in data.split('\n')])
print(df)
shaurya uppal
  • 2,335
  • 23
  • 28
  • 7
    If you want the first line to be used for column names, change the 2nd line to this: `df = pd.DataFrame([x.split(';') for x in data.split('\n')[1:]], columns=[x for x in data.split('\n')[0].split(';')])` – Mabyn Oct 18 '19 at 01:34
  • 1
    This is wrong, since on CSV files the newline (\n) character can be part of a field. – Antonio Ercole De Luca Apr 03 '20 at 13:13
  • 1
    This is not very robust, and most people would be better with the accepted answer. There is a very partial list of things that can go wrong with this at http://thomasburette.com/blog/2014/05/25/so-you-want-to-write-your-own-CSV-code/ – DanB May 15 '20 at 17:21
17

A quick and easy solution for interactive work is to copy-and-paste the text by loading the data from the clipboard.

Select the content of the string with your mouse:

Copy data for pasting into a Pandas dataframe

In the Python shell use read_clipboard()

>>> pd.read_clipboard()
  col1;col2;col3
0       1;4.4;99
1      2;4.5;200
2       3;4.7;65
3      4;3.2;140

Use the appropriate separator:

>>> pd.read_clipboard(sep=';')
   col1  col2  col3
0     1   4.4    99
1     2   4.5   200
2     3   4.7    65
3     4   3.2   140

>>> df = pd.read_clipboard(sep=';') # save to dataframe
user2314737
  • 21,279
  • 16
  • 81
  • 95
11

In one line, but first import IO

import pandas as pd
import io   

TESTDATA="""col1;col2;col3
1;4.4;99
2;4.5;200
3;4.7;65
4;3.2;140
"""

df = pd.read_csv(  io.StringIO(TESTDATA)  , sep=";")
print ( df )
user3810512
  • 119
  • 1
  • 3
  • What is the difference between this and the accepted answer? Except you move io operation to read_csv, which makes no difference... Please always check if similar answer is not posted already, redundancy is unnecessary. – Ruli Nov 26 '20 at 09:35
8

This answer applies when a string is manually entered, not when it's read from somewhere.

A traditional variable-width CSV is unreadable for storing data as a string variable. Especially for use inside a .py file, consider fixed-width pipe-separated data instead. Various IDEs and editors may have a plugin to format pipe-separated text into a neat table.

Using read_csv

Store the following in a utility module, e.g. util/pandas.py. An example is included in the function's docstring.

import io
import re

import pandas as pd


def read_psv(str_input: str, **kwargs) -> pd.DataFrame:
    """Read a Pandas object from a pipe-separated table contained within a string.

    Input example:
        | int_score | ext_score | eligible |
        |           | 701       | True     |
        | 221.3     | 0         | False    |
        |           | 576       | True     |
        | 300       | 600       | True     |

    The leading and trailing pipes are optional, but if one is present,
    so must be the other.

    `kwargs` are passed to `read_csv`. They must not include `sep`.

    In PyCharm, the "Pipe Table Formatter" plugin has a "Format" feature that can 
    be used to neatly format a table.

    Ref: https://stackoverflow.com/a/46471952/
    """

    substitutions = [
        ('^ *', ''),  # Remove leading spaces
        (' *$', ''),  # Remove trailing spaces
        (r' *\| *', '|'),  # Remove spaces between columns
    ]
    if all(line.lstrip().startswith('|') and line.rstrip().endswith('|') for line in str_input.strip().split('\n')):
        substitutions.extend([
            (r'^\|', ''),  # Remove redundant leading delimiter
            (r'\|$', ''),  # Remove redundant trailing delimiter
        ])
    for pattern, replacement in substitutions:
        str_input = re.sub(pattern, replacement, str_input, flags=re.MULTILINE)
    return pd.read_csv(io.StringIO(str_input), sep='|', **kwargs)

Non-working alternatives

The code below doesn't work properly because it adds an empty column on both the left and right sides.

df = pd.read_csv(io.StringIO(df_str), sep=r'\s*\|\s*', engine='python')

As for read_fwf, it doesn't actually use so many of the optional kwargs that read_csv accepts and uses. As such, it shouldn't be used at all for pipe-separated data.

Acumenus
  • 41,481
  • 14
  • 116
  • 107
  • 1
    I found (by trial&error) that `read_fwf` takes more of `read_csv`s arguments than is documented, but it's true that [some have no effect](https://stackoverflow.com/q/59757478/974555). – gerrit Jan 20 '20 at 16:12
-3

Simplest way is to save it to temp file and then read it:

import pandas as pd

CSV_FILE_NAME = 'temp_file.csv'  # Consider creating temp file, look URL below
with open(CSV_FILE_NAME, 'w') as outfile:
    outfile.write(TESTDATA)
df = pd.read_csv(CSV_FILE_NAME, sep=';')

Right way of creating temp file: How can I create a tmp file in Python?

QtRoS
  • 1,017
  • 1
  • 16
  • 22