171

I'm trying to read a csv-file from given URL, using Python 3.x:

import pandas as pd
import requests

url = "https://github.com/cs109/2014_data/blob/master/countries.csv"
s = requests.get(url).content
c = pd.read_csv(s)

I have the following error

"Expected file path name or file-like object, got <class 'bytes'> type"

How can I fix this? I'm using Python 3.4

smci
  • 26,085
  • 16
  • 96
  • 138
venom
  • 1,843
  • 2
  • 9
  • 8
  • You would need something like `c=pd.read_csv(io.StringIO(s.decode("utf-8")))` but you are getting html back not a csv file so it is not going to work – Padraic Cunningham Sep 04 '15 at 14:49
  • 4
    I'm fairly certain the URL you want is `"https://raw.github.com/cs109/2014_data/blob/master/countries.csv"`. – kylieCatt Sep 04 '15 at 14:52
  • @venom, chose more popular answer as the right one – ibodi Oct 11 '19 at 15:58
  • Sicne the issue was with `pandas.read_csv()` not Python, you should have stated the pandas version too, but given [Python 3.4 was released in 2014](https://www.python.org/downloads/release/python-340/), so you were likely running [pandas 0.12 .. 0.15](https://github.com/pandas-dev/pandas/releases?after=v0.15.2) – smci Jan 31 '21 at 04:01

6 Answers6

264

In the latest version of pandas (0.19.2) you can directly pass the url

import pandas as pd

url="https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
c=pd.read_csv(url)
inodb
  • 3,739
  • 2
  • 15
  • 17
  • it seems that using this directly instead of requests directly does not use [requests-cache](https://pypi.python.org/pypi/requests-cache) even if used – Shadi Sep 11 '17 at 10:23
  • 5
    That code returns `urllib.error.URLError: ` because of the https protocol which urllib cannot handle. – multigoodverse Feb 13 '18 at 16:00
  • For those using Python 2, you will have to use Python 2.7.10+. – avelis Oct 30 '18 at 03:54
  • There seems to be some issue reading csv from a URL. I read the file once from a local storage and once from URL, I kept getting errors from URL. I then enabled error_bad_lines=False and more than 99% of data was ignored. The URL is [link](https://www.kaggle.com/c/digit-recognizer/download/train.csv). Once I read the file, the shape of the dataset was found to be (88,1), which is completely wrong – Rishik Mani Nov 12 '18 at 19:09
  • It seems not work well, I got an issue of urlopen error :`` – ShinNShirley Aug 19 '20 at 07:15
  • I installed certificate following https://stackoverflow.com/questions/52805115/certificate-verify-failed-unable-to-get-local-issuer-certificate, then pd.read_csv(url) works for me. – Emily Nov 11 '20 at 20:00
  • It doesn't work for me either – MSB Dec 06 '20 at 22:59
206

UPDATE: From pandas 0.19.2 you can now just pass read_csv() the url directly, although that will fail if it requires authentication.


For older pandas versions, or if you need authentication, or for any other HTTP-fault-tolerant reason:

Use pandas.read_csv with a file-like object as the first argument.

  • If you want to read the csv from a string, you can use io.StringIO.

  • For the URL https://github.com/cs109/2014_data/blob/master/countries.csv, you get html response, not raw csv; you should use the url given by the Raw link in the github page for getting raw csv response , which is https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv

Example:

import pandas as pd
import io
import requests
url="https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
s=requests.get(url).content
c=pd.read_csv(io.StringIO(s.decode('utf-8')))

Notes:

in Python 2.x, the string-buffer object was StringIO.StringIO

smci
  • 26,085
  • 16
  • 96
  • 138
Anand S Kumar
  • 76,986
  • 16
  • 159
  • 156
  • What if the response is large and I want to stream it instead of consuming memory for the encoded content, decoded content and the StringIO object? – akaihola Oct 04 '16 at 06:00
  • 13
    In the latest version of pandas you can give the url directly i.e. `c=pd.read_csv(url)` – inodb Jan 26 '17 at 18:29
  • Curiously I have a newer version of `pandas` (0.23.4), but I could not give url directly. This answer helped me get that working. – Antti Jan 11 '19 at 14:55
  • 3
    "Update From pandas 0.19.2 you can now just pass the url directly." Unless you can't because you need to pass authentication arguments, in which case the original example is much needed. – Aaron Hall Jul 12 '19 at 17:49
  • This solution still valuable if you need a better error handling using HTTP codes that may be returned by request object (ex: 500 -> retry may be needed, 404 -> no retry) – JulienV Feb 18 '20 at 11:00
  • This seems to put all columns in one column for this url: https://www.ebi.ac.uk/Tools/services/rest/clustalo/result/clustalo-I20201101-053806-0987-43608676-p1m/pim – mLstudent33 Nov 01 '20 at 05:55
  • This allows you to specify a timeout in requests.get, which one should always set in production code – fmalina May 11 '21 at 15:18
15

As I commented you need to use a StringIO object and decode i.e c=pd.read_csv(io.StringIO(s.decode("utf-8"))) if using requests, you need to decode as .content returns bytes if you used .text you would just need to pass s as is s = requests.get(url).text c = pd.read_csv(StringIO(s)).

A simpler approach is to pass the correct url of the raw data directly to read_csv, you don't have to pass a file like object, you can pass a url so you don't need requests at all:

c = pd.read_csv("https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv")

print(c)

Output:

                              Country         Region
0                             Algeria         AFRICA
1                              Angola         AFRICA
2                               Benin         AFRICA
3                            Botswana         AFRICA
4                             Burkina         AFRICA
5                             Burundi         AFRICA
6                            Cameroon         AFRICA
..................................

From the docs:

filepath_or_buffer :

string or file handle / StringIO The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. For instance, a local file could be file ://localhost/path/to/table.csv

Padraic Cunningham
  • 160,756
  • 20
  • 201
  • 286
  • 1
    You can feed the url directly to pandas read_csv! of course! that's a much simpler solution than the one I found! :D – PabTorre Sep 04 '15 at 15:19
  • 1
    @pabtorre, yep , an example of why reading the docs is a good idea. – Padraic Cunningham Sep 04 '15 at 15:21
  • That works, in my case though ,I need to set the param `sep` of function `pd.read_csv`, such as : `pd.read_csv(StringIO(s), sep='\t')` . If I use the default setting `sep=None` , it'll raise an error`Error tokenizing data. C error: Expected 1 fields in line 6, saw 5` – ShinNShirley Aug 19 '20 at 07:21
  • Why do I still get just one column for this url? https://www.ebi.ac.uk/Tools/services/rest/clustalo/result/clustalo-I20201101-053806-0987-43608676-p1m/pim – mLstudent33 Nov 01 '20 at 05:57
10

The problem you're having is that the output you get into the variable 's' is not a csv, but a html file. In order to get the raw csv, you have to modify the url to:

'https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv'

Your second problem is that read_csv expects a file name, we can solve this by using StringIO from io module. Third problem is that request.get(url).content delivers a byte stream, we can solve this using the request.get(url).text instead.

End result is this code:

from io import StringIO

import pandas as pd
import requests
url='https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv'
s=requests.get(url).text

c=pd.read_csv(StringIO(s))

output:

>>> c.head()
    Country  Region
0   Algeria  AFRICA
1    Angola  AFRICA
2     Benin  AFRICA
3  Botswana  AFRICA
4   Burkina  AFRICA
PabTorre
  • 2,488
  • 18
  • 28
2
url = "https://github.com/cs109/2014_data/blob/master/countries.csv"
c = pd.read_csv(url, sep = "\t")
0

To Import Data through URL in pandas just apply the simple below code it works actually better.

import pandas as pd
train = pd.read_table("https://urlandfile.com/dataset.csv")
train.head()

If you are having issues with a raw data then just put 'r' before URL

import pandas as pd
train = pd.read_table(r"https://urlandfile.com/dataset.csv")
train.head()
jain
  • 85
  • 8