Pandas read_csv from url

Question

I'm trying to read a csv-file from given URL, using Python 3.x:

import pandas as pd
import requests

url = "https://github.com/cs109/2014_data/blob/master/countries.csv"
s = requests.get(url).content
c = pd.read_csv(s)

I have the following error

"Expected file path name or file-like object, got <class 'bytes'> type"

How can I fix this? I'm using Python 3.4

You would need something like `c=pd.read_csv(io.StringIO(s.decode("utf-8")))` but you are getting html back not a csv file so it is not going to work — Padraic Cunningham, Sep 04 '15 at 14:49
I'm fairly certain the URL you want is `"https://raw.github.com/cs109/2014_data/blob/master/countries.csv"`. — kylieCatt, Sep 04 '15 at 14:52
Sicne the issue was with `pandas.read_csv()` not Python, you should have stated the pandas version too, but given [Python 3.4 was released in 2014](https://www.python.org/downloads/release/python-340/), so you were likely running [pandas 0.12 .. 0.15](https://github.com/pandas-dev/pandas/releases?after=v0.15.2) — smci, Jan 31 '21 at 04:01

score 264 · Answer 1 · answered Jan 26 '17 at 18:34

264

In the latest version of pandas (0.19.2) you can directly pass the url

import pandas as pd

url="https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
c=pd.read_csv(url)

answered Jan 26 '17 at 18:34

inodb

3,739
2
15
17

it seems that using this directly instead of requests directly does not use [requests-cache](https://pypi.python.org/pypi/requests-cache) even if used – Shadi Sep 11 '17 at 10:23
5

That code returns `urllib.error.URLError: ` because of the https protocol which urllib cannot handle. – multigoodverse Feb 13 '18 at 16:00
For those using Python 2, you will have to use Python 2.7.10+. – avelis Oct 30 '18 at 03:54
There seems to be some issue reading csv from a URL. I read the file once from a local storage and once from URL, I kept getting errors from URL. I then enabled error_bad_lines=False and more than 99% of data was ignored. The URL is [link](https://www.kaggle.com/c/digit-recognizer/download/train.csv). Once I read the file, the shape of the dataset was found to be (88,1), which is completely wrong – Rishik Mani Nov 12 '18 at 19:09
It seems not work well, I got an issue of urlopen error :`` – ShinNShirley Aug 19 '20 at 07:15
I installed certificate following https://stackoverflow.com/questions/52805115/certificate-verify-failed-unable-to-get-local-issuer-certificate, then pd.read_csv(url) works for me. – Emily Nov 11 '20 at 20:00
It doesn't work for me either – MSB Dec 06 '20 at 22:59

score 206 · Accepted Answer · edited Jan 31 '21 at 03:54

206

UPDATE: From pandas 0.19.2 you can now just pass read_csv() the url directly, although that will fail if it requires authentication.

For older pandas versions, or if you need authentication, or for any other HTTP-fault-tolerant reason:

Use pandas.read_csv with a file-like object as the first argument.

If you want to read the csv from a string, you can use io.StringIO.
For the URL https://github.com/cs109/2014_data/blob/master/countries.csv, you get html response, not raw csv; you should use the url given by the Raw link in the github page for getting raw csv response , which is https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv

Example:

import pandas as pd
import io
import requests
url="https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
s=requests.get(url).content
c=pd.read_csv(io.StringIO(s.decode('utf-8')))

Notes:

in Python 2.x, the string-buffer object was StringIO.StringIO

edited Jan 31 '21 at 03:54

smci

26,085
16
96
138

answered Sep 04 '15 at 14:50

Anand S Kumar

76,986
16
159
156

What if the response is large and I want to stream it instead of consuming memory for the encoded content, decoded content and the StringIO object? – akaihola Oct 04 '16 at 06:00
13

In the latest version of pandas you can give the url directly i.e. `c=pd.read_csv(url)` – inodb Jan 26 '17 at 18:29
Curiously I have a newer version of `pandas` (0.23.4), but I could not give url directly. This answer helped me get that working. – Antti Jan 11 '19 at 14:55
3

"Update From pandas 0.19.2 you can now just pass the url directly." Unless you can't because you need to pass authentication arguments, in which case the original example is much needed. – Aaron Hall Jul 12 '19 at 17:49
This solution still valuable if you need a better error handling using HTTP codes that may be returned by request object (ex: 500 -> retry may be needed, 404 -> no retry) – JulienV Feb 18 '20 at 11:00
This seems to put all columns in one column for this url: https://www.ebi.ac.uk/Tools/services/rest/clustalo/result/clustalo-I20201101-053806-0987-43608676-p1m/pim – mLstudent33 Nov 01 '20 at 05:55
This allows you to specify a timeout in requests.get, which one should always set in production code – fmalina May 11 '21 at 15:18

Padraic Cunningham · Answer 3 · 2015-09-04T15:20:06.073

As I commented you need to use a StringIO object and decode i.e c=pd.read_csv(io.StringIO(s.decode("utf-8"))) if using requests, you need to decode as .content returns bytes if you used .text you would just need to pass s as is s = requests.get(url).text c = pd.read_csv(StringIO(s)).

A simpler approach is to pass the correct url of the raw data directly to read_csv, you don't have to pass a file like object, you can pass a url so you don't need requests at all:

c = pd.read_csv("https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv")

print(c)

Output:

                              Country         Region
0                             Algeria         AFRICA
1                              Angola         AFRICA
2                               Benin         AFRICA
3                            Botswana         AFRICA
4                             Burkina         AFRICA
5                             Burundi         AFRICA
6                            Cameroon         AFRICA
..................................

From the docs:

filepath_or_buffer :

string or file handle / StringIO The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. For instance, a local file could be file ://localhost/path/to/table.csv

You can feed the url directly to pandas read_csv! of course! that's a much simpler solution than the one I found! :D — PabTorre, Sep 04 '15 at 15:19
@pabtorre, yep , an example of why reading the docs is a good idea. — Padraic Cunningham, Sep 04 '15 at 15:21
That works, in my case though ,I need to set the param `sep` of function `pd.read_csv`, such as : `pd.read_csv(StringIO(s), sep='\t')` . If I use the default setting `sep=None` , it'll raise an error`Error tokenizing data. C error: Expected 1 fields in line 6, saw 5` — ShinNShirley, Aug 19 '20 at 07:21
Why do I still get just one column for this url? https://www.ebi.ac.uk/Tools/services/rest/clustalo/result/clustalo-I20201101-053806-0987-43608676-p1m/pim — mLstudent33, Nov 01 '20 at 05:57

score 10 · Answer 4 · answered Sep 04 '15 at 15:18

The problem you're having is that the output you get into the variable 's' is not a csv, but a html file. In order to get the raw csv, you have to modify the url to:

'https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv'

Your second problem is that read_csv expects a file name, we can solve this by using StringIO from io module. Third problem is that request.get(url).content delivers a byte stream, we can solve this using the request.get(url).text instead.

End result is this code:

from io import StringIO

import pandas as pd
import requests
url='https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv'
s=requests.get(url).text

c=pd.read_csv(StringIO(s))

output:

>>> c.head()
    Country  Region
0   Algeria  AFRICA
1    Angola  AFRICA
2     Benin  AFRICA
3  Botswana  AFRICA
4   Burkina  AFRICA

score 2 · Answer 5 · answered Jan 21 '20 at 08:35

2

url = "https://github.com/cs109/2014_data/blob/master/countries.csv"
c = pd.read_csv(url, sep = "\t")

answered Jan 21 '20 at 08:35

Gursimran Singh

31
2

Please provide explanation how your solution works. – Selim Yildiz Jan 21 '20 at 08:42
This may raise an url error ：`urlopen error [Errno 11004] getaddrinfo failed` – ShinNShirley Aug 19 '20 at 07:11

score 0 · Answer 6 · answered Nov 25 '19 at 03:25

To Import Data through URL in pandas just apply the simple below code it works actually better.

import pandas as pd
train = pd.read_table("https://urlandfile.com/dataset.csv")
train.head()

If you are having issues with a raw data then just put 'r' before URL

import pandas as pd
train = pd.read_table(r"https://urlandfile.com/dataset.csv")
train.head()

Pandas read_csv from url

6 Answers6

To Import Data through URL in pandas just apply the simple below code it works actually better.

If you are having issues with a raw data then just put 'r' before URL

Linked

Related