1

I'm writing a web crawler and need to save the html from the webpage I crawled into my MongoDB database. This is what I'm trying to do(I'm using pymongo):

        c=urllib2.urlopen(myUrl)
        html=c.read()
        db.urls.insert(
                {
                    "url":myUrl,
                    "HTML":html
                }
        )

When I run my script, I get the following error:

InvalidStringData: strings in documents must be valid UTF-8

I tried looking up my problem and figured out that I need to process the HTML somehow before saving it, so it's UTF-8 compatible, but I couldn't find how.

I don't think my question is a duplicate of python encoding utf-8 since I do not see how that question is related to HTML. If I'm wrong, or my problem has nothing to do with HTML, please direct me.

Community
  • 1
  • 1
user2980055
  • 169
  • 1
  • 13
  • possible duplicate of [python encoding utf-8](http://stackoverflow.com/questions/15092437/python-encoding-utf-8) –  Jun 15 '15 at 14:20

1 Answers1

0

To transform from string to utf

html.decode('utf8')

This encodes to utf8 your string content.

silviud
  • 925
  • 1
  • 10
  • 20
  • Thanks for copying from the [duplicate](http://stackoverflow.com/questions/15092437/python-encoding-utf-8) and by the way it's "encode" and not "decode" that is required here. –  Jun 15 '15 at 15:03
  • it says is not a duplicate ... anyways if you have a string s = 'asdasdc' ; type(s) str; s.encode('utf8') 'asdasdc' ; s.decode('utf8') u'asdasdc' so the last one is unicode – silviud Jun 15 '15 at 15:55
  • @user3561036 this is not a duplicate dude, neither encode or decode work for me. With both of them I get `UnicodeDecodeError` – user2980055 Jun 16 '15 at 10:03
  • what python version do you use ? i tested on 2.6 dude. – silviud Jun 16 '15 at 16:40
  • btw - you can try to convert the data to binary ... will insert into the db but not sure will allow text searches - use http://api.mongodb.org/python/current/api/bson/binary.html – silviud Jun 16 '15 at 16:47