4

I'm attempting to extract article information using the python newspaper3k package and then write to a CSV file. While the info is downloaded correctly, I'm having issues with the output to CSV. I don't think I fully understand unicode, despite my efforts to read about it.

from newspaper import Article, Source
import csv

first_article = Article(url="http://www.bloomberg.com/news/articles/2016-09-07/asian-stock-futures-deviate-as-s-p-500-ends-flat-crude-tops-46")

first_article.download()
if first_article.is_downloaded:
    first_article.parse()
    first_article.nlp

article_array = []
collate = {}

collate['title'] = first_article.title
collate['content'] = first_article.text
collate['keywords'] = first_article.keywords
collate['url'] = first_article.url
collate['summary'] = first_article.summary
print(collate['content'])
article_array.append(collate)

keys = article_array[0].keys()
with open('bloombergtest.csv', 'w') as output_file:
    csv_writer = csv.DictWriter(output_file, keys)
    csv_writer.writeheader()
    csv_writer.writerows(article_array)

output_file.close()

When I print collate['content'], which is first_article.text, the console outputs the article's content just fine. Everything shows up correctly, apostrophes and all. When I write to the CVS, the content cell text has odd characters in it. For example:

“At the end of the day, Europe’s economy isn’t in great shape, inflation doesn’t look exciting and there are a bunch of political risks to reckon with.

So far I have tried:

with open('bloombergtest.csv', 'w', encoding='utf-8') as output_file:

to no avail. I also tried utf-16 instead of 8, but that just resulted in the cells writing in an odd order. It didn't create the cells correctly in the CSV, although the output looked correct. I've also tried .encode('utf-8') are various variable but nothing has worked.

What's going on? Why would the console print the text correctly, while the CSV file has odd characters? How can I fix this?

sirryankennedy
  • 217
  • 4
  • 9

3 Answers3

8

Add encoding='utf-8-sig' to open(). Excel requires the UTF-8-encoded BOM code point (Byte Order Mark, U+FEFF) signature to interpret a file as UTF-8; otherwise, it assumes the default localized encoding.

Mark Tolonen
  • 132,868
  • 21
  • 152
  • 208
5

Changing with open('bloombergtest.csv', 'w', encoding='utf-8') as output_file: to with open('bloombergtest.csv', 'w', encoding='utf-8-sig') as output_file:, worked, as recommended by Leon and Mark Tolonen.

sirryankennedy
  • 217
  • 4
  • 9
4

That's most probably a problem with the software that you use to open or print the CSV file - it doesn't "understand" that CSV is encoded in UTF-8 and assumes ASCII, latin-1, ISO-8859-1 or a similar encoding for it.

You can aid that software in recognizing the CSV file's encoding by placing a BOM sequence in the beginning of your file (which, in general, is not recommended for UTF-8).

Community
  • 1
  • 1
Leon
  • 28,052
  • 3
  • 52
  • 82
  • 1
    I'm opening it in excel. Is there no way to write universal characters? – sirryankennedy Sep 10 '16 at 12:01
  • @sirryankennedy Have you tried writing UTF-8 with BOM (as shown in the linked answer)? – Leon Sep 10 '16 at 13:13
  • @sirryankennedy: there is no "universal" encoding. Even plain ASCII is not "universal". If you want to use a one-byte encoding, convert to one that contains your curly quotes, such as Windows-1252. – Jongware Sep 10 '16 at 15:27
  • Yep, adding sig to the encoding works. Thank you all! – sirryankennedy Sep 10 '16 at 18:38
  • @sirryankennedy: if this (or another) answer worked for you, you should consider [marking it as Accepted](http://stackoverflow.com/help/someone-answers). You may not be aware of this as you chose to skip the introductory [tour] when signing up. – Jongware Sep 10 '16 at 20:46
  • Got it! Thanks. – sirryankennedy Sep 11 '16 at 23:26