0

I need to normalize text from Italian wiki using python3 and nltk and I've got one problem. Most of the words are OK, but some words are mapped incorrect, to be more exact - some symbols.

For example:

'fruibilit\xe3', 'n\xe2\xba', 'citt\xe3'

I'm sure that the problem is in symbols like à, è.

Code:

# coding: utf8
import os

from nltk import corpus, word_tokenize, ConditionalFreqDist


it_sw_plus = corpus.stopwords.words('italian') + ['doc', 'https']
#it_folder_names = ['AA', 'AB', 'AC', 'AD', 'AE', 'AF']
it_path = os.listdir('C:\\Users\\1\\projects\\i')
it_corpora = []

def normalize(raw_text):
    tokens = word_tokenize(raw_text)
    norm_tokens = []
    for token in tokens:
        if token not in it_sw_plus and token.isalpha():
            token = token.lower().encode('utf8')
            norm_tokens.append(token)
    return norm_tokens

for folder_name in it_path:
    path_to_files = 'C:\\Users\\1\\projects\\i\\%s' % (folder_name)
    files_list = os.listdir(path_to_files)
    for file_name in files_list:
        file_path = path_to_files + '\\' + file_name
        text_file = open(file_path)
        raw_text = text_file.read().decode('utf8')
        norm_tokens = normalize(raw_text)
        it_corpora.append(norm_tokens)
    print(it_corpora)

How can I resolve this problem? I'm running on Win7(rus).

When I try this code:

import io

with open('C:\\Users\\1\\projects\\i\\AA\\wiki_00', 'r', encoding='utf8') as fin:
    for line in fin:
        print (line) 

In PowerShell:

    <doc id="2" url="https://it.wikipedia.org/wiki?curid=2" title="Armonium">

Armonium



Traceback (most recent call last):
  File "i.py", line 5, in <module>
    print (line)
  File "C:\Python35-32\lib\encodings\cp866.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in position 3: character maps to <undefined>

In Python command line:

<doc id="2" url="https://it.wikipedia.org/wiki?curid=2" title="Armonium">

Armonium



Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\1\projects\i.py", line 5, in <module>
    print (line)
  File "C:\Python35-32\lib\encodings\cp866.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in position
3: character maps to <undefined>

When I try the request:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python35-32\lib\encodings\cp866.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in position
90: character maps to <undefined>
GiveItAwayNow
  • 327
  • 3
  • 13
  • Could you post the original file online? Otherwise it's hard to know what is the true encoding of the file. – alvas Jan 04 '16 at 16:39
  • https://github.com/GiteItAwayNow/TrueTry/blob/master/it – GiveItAwayNow Jan 04 '16 at 16:46
  • `open()` with `encoding=` parameter can only be used in `python3`! In python2 use `import io; io.open(filename, 'r', encoding='utf8')` – alvas Jan 04 '16 at 17:22
  • Try installing this: https://pypi.python.org/packages/source/w/win_unicode_console/win_unicode_console-0.4.zip and then correct your code to use `io.open` for python2 or stick with only python3 and see whether you read and print the text properly. – alvas Jan 04 '16 at 17:24
  • alvas, sorry, my mistake. But when I use python3 in PowerShell it raises the error like In the Python command line. I've tried the code in Python IDLE and it works correct. – GiveItAwayNow Jan 04 '16 at 17:29
  • Powershell has strange default encoding for stdout. In powershell, what is your output for `import sys; print (sys.stdout.encoding)`? Did you see something like `cp850`? – alvas Jan 04 '16 at 17:33
  • If you see `cp850`, then please install the https://pypi.python.org/packages/source/w/win_unicode_console/win_unicode_console-0.4.zip , it should work better than hacking the environment variables, see http://stackoverflow.com/questions/14630288/unicodeencodeerror-charmap-codec-cant-encode-character-maps-to-undefined – alvas Jan 04 '16 at 17:35
  • Ok. Thank you, alvas. – GiveItAwayNow Jan 04 '16 at 17:37
  • Did the `win-unicode-console` work? – alvas Jan 04 '16 at 21:00
  • alvas, thank you! It works. – GiveItAwayNow Jan 05 '16 at 14:15

1 Answers1

1

Try specifying the encoding when reading the file if you know the encoding, in python2

import io
with io.open(filename, 'r', encoding='latin-1') as fin:
    for line in fin:
        print line # line should be encoded as latin-1

But in your case, the file you've posted isn't a latin1 file but a utf8 file, in python3:

>>> import urllib.request
>>> url = 'https://raw.githubusercontent.com/GiteItAwayNow/TrueTry/master/it'
>>> response = urllib.request.urlopen(url)
>>> data = response.read()
>>> text = data.decode('utf8')
>>> print (text) # this prints the file perfectly.

To read a 'utf8' file in python2:

import io
with io.open(filename, 'r', encoding='utf8') as fin:
    for line in fin:
        print (line) # line should be encoded as utf8

To read a 'utf8' file, in python3:

with open(filename, 'r', encoding='utf8') as fin:
    for line in fin:
        print (line) # line should be encoded as utf8

As a good practice, when dealing with text data, try to use unicode and python3 whenever possible. Do take a look at

Additionally, if you haven't install this module for printing utf8 on windows console, you should try it:

pip install win-unicode-console

Or download this: https://pypi.python.org/packages/source/w/win_unicode_console/win_unicode_console-0.4.zip and then python setup.py install

Community
  • 1
  • 1
alvas
  • 94,813
  • 90
  • 365
  • 641
  • Out of curiosity, why do you think that the encoding is latin1? Especially when it looks like a wikipedia file, utf8 would have been the first guess. – alvas Jan 04 '16 at 16:55
  • Or I get the error: Traceback (most recent call last): File "", line 1, in File "C:\Python35-32\lib\encodings\cp866.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_map)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in position 90: character maps to – GiveItAwayNow Jan 04 '16 at 17:01
  • Can you post your full code in the question. I'm not exactly sure what caused it the unicode mapping to fail. Did you try with the `request` url too? Did that work fine? – alvas Jan 04 '16 at 17:04
  • Did the error happen in the printing or at the for loop? If it's at the printing, then it's most probably because you can't print utf8 properly on your machine. – alvas Jan 04 '16 at 17:08
  • Yes, I tried with the request and I got the UnicodeEncodeError. alvas, what do you mean about full code? The full module in the question. – GiveItAwayNow Jan 04 '16 at 17:10
  • Please post the edited one with `utf8` and how you're reading the file. – alvas Jan 04 '16 at 17:11