5

I have a simple project of scraping reviews from a tourist site and store it in a excel file. Reviews could be in spanish, japanese or any other language, also reviews sometimes contains special symbols like "❤❤".

I need to store all the data (special symbols can be excluded if can't be written).

I am able to scrape the data i want and print it in the console as it is (like japanese text), but problem is with storing it in the csv file, it is showing error message as shown below

i tried opening the file with utf-8 encoding (As mentioned in below comment) but then it saves the data in some weird symbols that makes no sense .... and couldn't find an answer to the problem. Any suggestions.

I am using python 3.5.3

My code for python:

from selenium import webdriver
from bs4 import BeautifulSoup
import time
import re

file = "TajMahalSpanish.csv"
f = open(file, "w")
headers = "rating, title, review\n"
f.write(headers)

pages = 119
pageNumber = 2
option = webdriver.ChromeOptions()
option.add_argument("--incognito")

browser = webdriver.Chrome(executable_path='C:\Program Files\JetBrains\PyCharm Community Edition 2017.1.5\chrome webdriver\chromedriver', chrome_options=option)

browser.get("https://www.tripadvisor.in/Attraction_Review-g297683-d317329-Reviews-Taj_Mahal-Agra_Agra_District_Uttar_Pradesh.html")
time.sleep(10)
browser.find_element_by_xpath('//*[@id="taplc_location_review_filter_controls_0_form"]/div[4]/ul/li[5]/a').click()
time.sleep(5)
browser.find_element_by_xpath('//*[@id="BODY_BLOCK_JQUERY_REFLOW"]/span/div[1]/div/form/ul/li[2]/label').click()
time.sleep(5)

while (pages):
    html = browser.page_source
    soup = BeautifulSoup(html, "html.parser")
    containers = soup.find_all("div",{"class":"innerBubble"})

    showMore = soup.find("span", {"onclick": "widgetEvCall('handlers.clickExpand',event,this);"})
    if showMore:
        browser.find_element_by_xpath("//span[@onclick=\"widgetEvCall('handlers.clickExpand',event,this);\"]").click()
        time.sleep(3)
        html = browser.page_source
        soup = BeautifulSoup(html, "html.parser")
        containers = soup.find_all("div", {"class": "innerBubble"})
        showMore = False

    for container in containers:
        bubble = container.div.div.span["class"][1]
        title = container.div.find("div", {"class": "quote"}).a.span.text
        review = container.find("p", {"class": "partial_entry"}).text
        f.write(bubble + "," + title.replace(",", "|").replace("\n", "...") + "," + review.replace(",", "|").replace("\n", "...") + "\n")
        print(bubble)
        print(title)
        print(review)
    browser.find_element_by_xpath("//div[@class='ppr_rup ppr_priv_location_reviews_list']//div[@class='pageNumbers']/span[@data-page-number='" + str(pageNumber) + "']").click()
    time.sleep(5)
    pages -= 1
    pageNumber += 1

f.close()

I am getting the following error:

Traceback (most recent call last):
  File "C:/Users/Akshit/Documents/pycharmProjects/spanish.py", line 45, in <module>
    f.write(bubble + "," + title.replace(",", "|").replace("\n", "...") + "," + review.replace(",", "|").replace("\n", "...") + "\n")
  File "C:\Users\Akshit\AppData\Local\Programs\Python\Python35\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 10-18: character maps to <undefined>

Process finished with exit code 1

UPDATE

I am trying a workaround to this problem. At the end i need to translate the Japanese reviews to english as well for the research, so may be i can use one of the google api's to tranlsate the string in the code itself before writing it and then write it into the csv file....

  • 1
    try `f = open(file, "w", encoding='utf-8')`. On a side-note, opening files with a context manager is robuster, and I would separate the different parts of the program in different functions (getting the content, scraping the content, writing away the results) – Maarten Fabré Aug 03 '17 at 13:58
  • 1
    Possible duplicate of [Handling non-standard American English Characters and Symbols in a CSV, using Python](https://stackoverflow.com/questions/12357261/handling-non-standard-american-english-characters-and-symbols-in-a-csv-using-py) – JeffC Aug 03 '17 at 14:43
  • @MaartenFabré that removes the error but it is not actually printing the same thing into the file....e.g. it should print **"美しい幾何学模様!"** but rather it is prints **"美ã—ã„幾何学模様ï¼"** – Akshit Agarwal Aug 03 '17 at 23:43
  • @JeffC i don't think so, i can scrape the data i need and print it in the console as it is......its just i am not able to write it as it is in the csv file, please check the question again, it has been edited now.. – Akshit Agarwal Aug 04 '17 at 00:20
  • is the data printed correctly to `stout`? Perhaps BeautifulSoup parses it with the incorrect encoding https://stackoverflow.com/questions/20205455/how-to-correctly-parse-utf-8-encoded-html-to-unicode-strings-with-beautifulsoup#20215100 – Maarten Fabré Aug 04 '17 at 08:04
  • @MaartenFabré yes i am able to print it out correctly (pls check the question again, i had edited it), data is store in my variable as i need it to be, but could not write it into the csv file... I had even tried taking a japanese string in a variable directly (without scraping) and storing it in the file, it doesn't work, so problem is with f.write() in to the CSV file – Akshit Agarwal Aug 04 '17 at 10:58
  • what program do you use to open the CSV, and does this program parse it as unicode? – Maarten Fabré Aug 04 '17 at 10:59
  • @MaartenFabré i am using Ms Excel 2015 – Akshit Agarwal Aug 04 '17 at 11:01
  • 1
    so you've found the source of your problem https://stackoverflow.com/a/6488070/1562285 – Maarten Fabré Aug 04 '17 at 11:19
  • @MaartenFabré found it, pls check the edit (in question), thx a lot – Akshit Agarwal Aug 04 '17 at 12:41

1 Answers1

2

UPDATE

Found the solution in

Is it possible to force Excel recognize UTF-8 CSV files automatically?

as suggested by @MaartenFabré in the comments.

Basically from what I understood, the problem is that Excel file has problems in reading csv file with utf-8 encoding so when i directly opens the csv file (made via python) with Excel...all the data is corrupted.

The solution is that:

  1. I saved the data in a text file, instead of csv in python
  2. Open Excel
  3. Go to import external data and import using a txt file
  4. select file type as "delimited" and file origin as "650001: Unicode (utf-8)"
  5. Select "," as the delimiter (your choice) and import
  6. Data is correctly shown in the excel in proper rows and column for every language...japenese, spanish, french etc.

Again thanks to @MaartenFabre for the help !

marc_s
  • 675,133
  • 158
  • 1,253
  • 1,388