Your problem is that urlopen
returns a bytes-oriented file-like object, while io.open
expects true text inputs (where "text" means "unicode
on Python 2, str
on Python 3").
The only thing you need to change is to decode
the result of calling read
; it's bytes-like by default, and you need true text. You need to figure out the correct encoding (either hard-coding it, or explicitly inspecting the headers to figure it out) to decode it correctly (it's likely either UTF-8 or, much less likely, cp1252, but it could be something weird).
In any event, knowing that, the only change you'd need to make is to change:
libro.write(archivo.read())
to:
libro.write(archivo.read().decode(knownencoding))
If you're pretty sure the server is always providing UTF-8 output, then:
libro.write(archivo.read().decode('utf-8'))
is sufficient. Yes, it's mildly wasteful (you decode it only to write it to a stream that immediately reencodes it), but importantly, this gives you a guarantee that the bytes you received were interpretable as valid UTF-8, which dumping the raw bytes to disk won't guarantee.
A more elaborate solution inspects the headers:
import urllib2
import io
import string
def n_palabras(x):
archivo = urllib2.urlopen(x)
# Find charset in headers, if it exists
for p in archivo.headers.plist:
key, sep, value = p.partition('=')
if sep and key.strip().lower() == 'charset':
encoding = value.strip()
break
else:
encoding = 'utf-8'
data = archivo.read()
try:
# Try to use parsed charset
data = data.decode(encoding)
except UnicodeDecodeError:
# If that fails, try UTF-8 as fallback; let exception bubble
# if this fails too
data = data.decode('utf-8')
with io.open("alice.txt", "w", encoding="utf-8") as libro:
libro.write(data)