Python TypeError on regex

Question

So, I have this code:

url = 'http://google.com'
linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>')
m = urllib.request.urlopen(url)
msg = m.read()
links = linkregex.findall(msg)

But then python returns this error:

links = linkregex.findall(msg)
TypeError: can't use a string pattern on a bytes-like object

What did I do wrong?

Which version of Python are you running? – Morten Kristensen Mar 03 '11 at 17:52 — Morten Kristensen, Mar 03 '11 at 17:52

score 70 · Accepted Answer · edited Jun 20 '20 at 09:12

70

TypeError: can't use a string pattern on a bytes-like object

what did i do wrong??

You used a string pattern on a bytes object. Use a bytes pattern instead:

linkregex = re.compile(b'<a\s*href=[\'|"](.*?)[\'"].*?>')
                       ^
            Add the b there, it makes it into a bytes object

(ps:

 >>> from disclaimer include dont_use_regexp_on_html
 "Use BeautifulSoup or lxml instead."

)

edited Jun 20 '20 at 09:12

Community

1
1

answered Mar 03 '11 at 19:23

Lennart Regebro

147,792
40
207
241

Will it break with python2? – Dilawar Oct 16 '16 at 07:36

Morten Kristensen · Answer 2 · 2011-03-03T18:00:37.160

3

If you are running Python 2.6 then there isn't any "request" in "urllib". So the third line becomes:

m = urllib.urlopen(url)

And in version 3 you should use this:

links = linkregex.findall(str(msg))

Because 'msg' is a bytes object and not a string as findall() expects. Or you could decode using the correct encoding. For instance, if "latin1" is the encoding then:

links = linkregex.findall(msg.decode("latin1"))

edited Mar 03 '11 at 18:00

answered Mar 03 '11 at 17:55

Morten Kristensen

7,059
4
26
48

He says in the comments that he's running 3.1.3, so there *is* a `request`. – John Mar 03 '11 at 18:07
Indeed, saw that afterwards. So I added the solution for version 3 as well. – Morten Kristensen Mar 03 '11 at 18:08

score 1 · Answer 3 · answered May 07 '13 at 14:54

The regular expression pattern and string have to be of the same type. If you're matching a regular string, you need a string pattern. If you're matching a byte string, you need a bytes pattern.

In this case m.read() returns a byte string, so you need a bytes pattern. In Python 3, regular strings are unicode strings, and you need the b modifier to specify a byte string literal:

linkregex = re.compile(b'<a\s*href=[\'|"](.*?)[\'"].*?>')

score 1 · Answer 4 · answered Mar 03 '11 at 17:54

1

Well, my version of Python doesn't have a urllib with a request attribute but if I use "urllib.urlopen(url)" I don't get back a string, I get an object. This is the type error.

answered Mar 03 '11 at 17:54

Jeremy Whitlock

3,700
24
16

Here is the link to docs backing this up: http://docs.python.org/library/urllib.html#urllib.urlopen – Jeremy Whitlock Mar 03 '11 at 17:55
Those are docs for 2.7. The OP says in the comments that he's using 3.1.3. – John Mar 03 '11 at 18:14
John, read the docs. The API is still the same. – Jeremy Whitlock Mar 03 '11 at 18:15
My point is, *your* version doesn't have the request attribute, but the OP's version *does*. You are correct on the cause of the type error. – John Mar 03 '11 at 18:18
Yeah, the version was mentioned after I put my answer up. ;) – Jeremy Whitlock Mar 03 '11 at 18:20
I misread the question I guess. Thanks for the down vote. – Jeremy Whitlock Mar 03 '11 at 19:33
Yes, you get an object. But if you then do a read() on that object you get a string. But however, under Python 3, you get a bytes object. This is a Python 3 issue, and has to do with the separation of binary and text data under Python 3. This answer is incorrect and not useful. Sorry. – Lennart Regebro Mar 03 '11 at 20:51
@Jeremy: I didn't downvote you. I don't know who did, but I was downvoted too. Why would someone downvote us all like this? – John Mar 03 '11 at 22:49

John · Answer 5 · 2011-03-03T18:17:13.420

1

The url you have for Google didn't work for me, so I substituted http://www.google.com/ig?hl=en for it which works for me.

Try this:

import re
import urllib.request

url="http://www.google.com/ig?hl=en"
linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>')
m = urllib.request.urlopen(url)
msg = m.read():
links = linkregex.findall(str(msg))
print(links)

Hope this helps.

edited Mar 03 '11 at 18:17

answered Mar 03 '11 at 18:04

John

13,752
11
42
64

4

This only works if your system Python default encoding is the same as the web pages encoding. – Lennart Regebro Mar 03 '11 at 19:25

score 0 · Answer 6 · answered Jul 16 '16 at 18:15

That worked for me in python3. Hope this helps

import urllib.request
import re
urls = ["https://google.com","https://nytimes.com","http://CNN.com"]
i = 0
regex = '<title>(.+?)</title>'
pattern = re.compile(regex)

while i < len(urls) :
    htmlfile = urllib.request.urlopen(urls[i])
    htmltext = htmlfile.read()
    titles = re.search(pattern, str(htmltext))
    print(titles)
    i+=1

And also this in which i added b before regex to convert it into byte array.

import urllib.request
import re
urls = ["https://google.com","https://nytimes.com","http://CNN.com"]
i = 0
regex = b'<title>(.+?)</title>'
pattern = re.compile(regex)

while i < len(urls) :
    htmlfile = urllib.request.urlopen(urls[i])
    htmltext = htmlfile.read()
    titles = re.search(pattern, htmltext)
    print(titles)
    i+=1

Python TypeError on regex

6 Answers6

Linked