2

I am trying to read a gzipped csv file from a url. This is a very big file with more than 50.000 lines. When I try the code below I get an error: _csv.Error: line contains NULL byte

import csv
import urllib2   
url = '[my-url-to-csv-file].gz'
response = urllib2.urlopen(url)
cr = csv.reader(response)

for row in cr:
    if len(row) <= 1: continue
        print row

If I try to print the content of the file before I try to read it I get something like this:

?M}?7?M==??7M???z?YJ?????5{Ci?jK??3b??p?

?[?=?j&=????=?0u'???}mwBt??-E?m??Ծ??????WM??wj??Z??ėe?D?VF????4=Y?Y?tA???

How can I read the gzipped csv file from this URL properly?

Daniel
  • 2,065
  • 3
  • 16
  • 34
biancamihai
  • 951
  • 6
  • 14
  • I don't think you need `csv.reader` here...have you tried `response = urllib2.urlopen(url)` `data = response.read()` `response.close()` `for line in data: print line`? – Daniel Aug 24 '15 at 02:50
  • If I try this method I get content but it is bad encoded I think, I get something like: ```% s Z ? o ? J 1 v ? } ? ? D ? ? ? ? ? ? ? ? ? ?``` – biancamihai Aug 24 '15 at 06:28
  • try `for line in data: line = line.decode('utf-8') print line` https://docs.python.org/dev/tutorial/stdlib.html#internet-access – Daniel Aug 24 '15 at 06:33
  • yes tried that but I get errors, does it matter that the csv in gzip? – biancamihai Aug 24 '15 at 10:14
  • It does matter; see my answer below. – Daniel Aug 24 '15 at 18:58

2 Answers2

2

How to Open a .gz (gzip) csv File from a URL with urllib2.urlopen

  1. Save the URL data to a file object. For this, you can use StringIO.StringIO().
  2. Decompress the .gz with gzip.Gzipfile().
  3. Read the data from your new file object.

To use your example:

from StringIO import StringIO
import gzip
import urllib2

url = '[my-url-to-csv-file].gz'
mem = StringIO(urlopen(url).read())
f = gzip.GzipFile(fileobj=mem, mode='rb')
data = f.read()

for line in data:
  print line
Daniel
  • 2,065
  • 3
  • 16
  • 34
0

Use a try and except, and if you don't care what happens when you encounter a NULL row, just use pass:

for row in cr:
    try:
        if len(row) <= 1: continue
            print row
    except Exception, e:
        print e
        #or if you're not worried about errors, you can use pass
rofls
  • 4,587
  • 3
  • 21
  • 33
  • Egad! This is definitely not ideal - http://stackoverflow.com/questions/21553327/why-is-except-pass-a-bad-programming-practice – Daniel Aug 24 '15 at 03:01
  • Sure, that's true. they could also do: `except Exception, e: print e` and then they can read their data without the programming being interrupted by `NULL` bytes. – rofls Aug 24 '15 at 03:06
  • odd I still get the same error, but the erroris on the ```for row in cr:``` line – biancamihai Aug 24 '15 at 06:26