0

I am trying to scrape the paragraphs from a wikipedia page.

I am getting this error:

UnicodeEncodeError: 'charmap' codec can't encode character u'\u2013'
in position 530: character maps to <undefined>

For example, I used this wikipedia page and wrote the following script in Python with BeautifulSoup and requests:

from bs4 import BeautifulSoup
import requests
soup=BeautifulSoup(r.content,"html.parser")
for i in soup.find_all("p"):
     print i.text
     print "\n"
Chris Martin
  • 28,558
  • 6
  • 66
  • 126
MrV
  • 125
  • 1
  • 6
  • Have you tried `set PYTHONIOENCODING=UTF-8`? – Sangbok Lee Mar 11 '17 at 08:12
  • Check [this](http://stackoverflow.com/questions/41729822/problems-writing-scraped-data-to-csv-with-slavic-characters-unicodeencodeerror) for detail. – Sangbok Lee Mar 11 '17 at 08:12
  • 1
    You should also edit your example to be more complete. In what line the error occurs? The problem could very well be in the request object encoding, but it would be easier to test for that if you could include the part where you fetch the request. Edit: nevermind, the problem is most likely with this: http://stackoverflow.com/questions/5419/python-unicode-and-the-windows-console – Teemu Risikko Mar 11 '17 at 08:56
  • You could use [wikipedia API](https://www.mediawiki.org/wiki/API:Main_page) instead of crawling the pages on yourself. – Christos Papoulas Mar 13 '17 at 09:39

0 Answers0