3

I'm currently going through the python challenge, and i'm up to level 4, see here I have only been learning python for a few months, and i'm trying to learn python 3 over 2.x so far so good, except when i use this bit of code, here's the python 2.x version:

import urllib, re
prefix = "http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing="
findnothing = re.compile(r"nothing is (\d+)").search
nothing = '12345'
while True:
    text = urllib.urlopen(prefix + nothing).read()
    print text
    match = findnothing(text)
    if match:
        nothing = match.group(1)
        print "   going to", nothing
    else:
        break

So to convert this to 3, I would change to this:

import urllib.request, urllib.parse, urllib.error, re
prefix = "http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing="
findnothing = re.compile(r"nothing is (\d+)").search
nothing = '12345'
while True:
    text = urllib.request.urlopen(prefix + nothing).read()
    print(text)
    match = findnothing(text)
    if match:
        nothing = match.group(1)
        print("   going to", nothing)
    else:
        break

So if i run the 2.x version it works fine, goes through the loop, scraping the url and goes to the end, i get the following output:

and the next nothing is 72198
   going to 72198
and the next nothing is 80992
   going to 80992
and the next nothing is 8880
   going to 8880 etc

If i run the 3.x version, i get the following output:

b'and the next nothing is 44827'
Traceback (most recent call last):
  File "C:\Python32\lvl4.py", line 26, in <module>
    match = findnothing(b"text")
TypeError: can't use a string pattern on a bytes-like object

So if i change the r to a b in this line

findnothing = re.compile(b"nothing is (\d+)").search

I get:

b'and the next nothing is 44827'
   going to b'44827'
Traceback (most recent call last):
  File "C:\Python32\lvl4.py", line 24, in <module>
    text = urllib.request.urlopen(prefix + nothing).read()
TypeError: Can't convert 'bytes' object to str implicitly

Any ideas?

I'm pretty new to programming, so please don't bite my head off.

_bk201

Sven Marnach
  • 483,142
  • 107
  • 864
  • 776
bk201
  • 327
  • 2
  • 5
  • 13

3 Answers3

4

You can't mix bytes and str objects implicitly.

The simplest thing would be to decode bytes returned by urlopen().read() and use str objects everywhere:

text = urllib.request.urlopen(prefix + nothing).read().decode() #note: utf-8

The page doesn't specify the preferable character encoding via Content-Type header or <meta> element. I don't know what the default encoding should be for text/html but the rfc 2068 says:

When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP.

jfs
  • 346,887
  • 152
  • 868
  • 1,518
1

Regular expressions make sense only on text, not on binary data. So, keep findnothing = re.compile(r"nothing is (\d+)").search, and convert text to string instead.

Valentin Lorentz
  • 8,859
  • 5
  • 42
  • 63
0

Instead of urllib we're using requests and it has two options ( which maybe you can search in urllib for similar options )

Response object

import requests
>>> response = requests.get('https://api.github.com')

Using response.content - has the bytes type

>>> response.content
b'{"current_user_url":"https://api.github.com/user","current_us...."}'

While using response.text - you have the encoded response

>>> response.text
'{"current_user_url":"https://api.github.com/user","current_us...."}'

The default encoding is utf-8, but you can set it right after the request like so

import requests
>>> response = requests.get('https://api.github.com')
>>> response.encoding = 'SOME_ENCODING'

And then response.text will hold the content in the encoding you requested ...

Ricky Levi
  • 5,482
  • 1
  • 47
  • 55