2

I have a text file full of non-ASCII characters. I can not detect the encoding by either file or enca.

file non_ascii.txt
non_ascii.txt: Non-ISO extended-ASCII text

enca non_ascii.txt
Unrecognized encoding

But I can open it normally in Windows Notepad++

Edit: The expression above leads misunderstanding. Sorry for this. In fact, I picked some parts of the original file and put them into new text file, then opened in notepad++.

The 2 parts shows as below. They are decoded in 2 different ways by notepad++. enter image description here

enter image description here

Question:

  1. How could I detect the files encoding under linux?
  2. how do I recover the characters represented by <F1><EE><E9><E4><FF>? I couldn't get result by "grep 'сойдя' win.txt" even though the "сойдя" is encoded into <F1><EE><E9><E4><FF>?

The file content slice as follows:

less non_ascii.txt
"non_ascii.txt" may be a binary file.  See it anyway?
<F1><EE><E9><E4><FF>
<F2><F0><E0><EA><F2><EE><E2><E0><F2><FC><F1><FF>
<D0><F2><E9><E4><D7><E9><E7><E1><EC><E1><F3><F8>
<D1><E5><EA><F3><ED><E4>
<F0><E0><E7><E3><F0><F3><E7><EA><E8>
<EF><EE><E4><F1><F2><E0><E2><EB><FF><F2><FC>
<F0><E0><E7><E3><F0><F3><E7><EA><E5>
<F1><EE><E9><E4><F3>
<F0><E0><E7><E3><F0><F3><E7><EA><E0>
<F1><EE><E2><EB><E0><E4><E0><EB><E8>
<C1><D7><E9><E1><F0><EF><FE><F4><E1>
<CB><C1><D3><D3><C9><D4><C5><D2><C9><D4>
<F1><EE><E2><EB><E0><E4><E0><EB><EE>
<F1><EE><E9><E4><E8>
<F1><EE><E2><EB><E0><E4><E0><EB><E0>
user1744585
  • 137
  • 1
  • 1
  • 10
  • 1
    What does `notepad++` think the encoding is? It should say that somewhere in the status bar. – nneonneo Nov 06 '15 at 01:48
  • I get 2 snippets from the file. they are showing "Windows-1251" and "ANSI". There maybe other encoding contained in the parts of the file. So is there ways to convert the mixed encoded content into UTF-8? – user1744585 Nov 06 '15 at 04:51
  • Your file contains parts encoded in different ways? – nneonneo Nov 06 '15 at 05:30
  • I got this file content from vary of sources. By python script reading lines from multiple files then write into one file finally. – user1744585 Nov 06 '15 at 07:23
  • The two samples don't match (are from different parts of the same files perhaps). If you want people to guess which Cyrillic encoding was used, you will have to post side-by-side examples of the same text. – Thomas Dickey Nov 06 '15 at 09:26
  • 3
    You cannot concatenate files in different encodings and then mechanucally transform the resulting mess into something that makes sense. – n. 'pronouns' m. Nov 06 '15 at 10:18
  • As per my answer, there doesn't actually seem to be multiple encodings in the file. Notepad++ would display bogus data if it thought it was CP1251 and some parts were in some other encoding. – tripleee Nov 06 '15 at 10:21
  • @tripleee agree with you. I can probably understand that the best practice is coverting encoding to utf-8 during importing original sources of files. – user1744585 Nov 10 '15 at 13:11
  • If you really have mixed encodings, perhaps see also https://stackoverflow.com/questions/48257946/read-files-with-different-encoding-format-using-sys-stdin-in-python3 – tripleee Jan 26 '21 at 14:18

1 Answers1

4

Your question really has two parts: (1) how do I identify an unknown encoding and (2) how do I convert that to something useful?

The first part is the real challenge, and really cannot be answered in universal terms -- in the general case, there is no reliable way to identify an unknown 8-bit encoding. Some encodings give you good hints (UTF-8 is an excellent example) and in many cases, if you have a good idea what the text is supposed to represent, the problem can be solved.

A mapping of 8-bit character meanings can be helpful (cough, the link is to mine) and in this case quickly hints at Windows code page 1251. Kudos for the hex dumps and the picture with the representation you expect!

With that out of the way, converting is easy.

iconv -f cp1251 -t utf-8 non_ascii.txt >utf8.txt

Provided your Linux system is set up to use UTF-8 at the terminal, your grep command should work on utf-8.txt now.

The indication that some of the text is "ANSI" (which is a bogus term anyway) is probably just a red herring -- as far as I can tell, everything in your excerpt looks like well-formed CP1251.

Some tools like chardet do a reasonable job of at least steering you in the right direction, though you have to understand that, like a human expert, they have to guess what the text is supposed to represent. There are corner cases where they just don't have enough information to guess correctly, either because there are several candidate encodings with very few differences (for example, Latin-1 vs Latin-9 vs Windows-1252, all of which also overlap with plain 7-bit US-ASCII in the first 128 positions) or because the input doesn't contain enough information to establish any common patterns.

tripleee
  • 139,311
  • 24
  • 207
  • 268
  • See also http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html – tripleee Nov 06 '15 at 22:06
  • More on how ANSI as a character-set identifier is not well-defined: https://stackoverflow.com/questions/701882/what-is-ansi-format – tripleee Jun 01 '18 at 06:20
  • The link to your (cough) mapping of 8-bit character meanings page is broken ;) – PavoDive Jan 28 '20 at 11:43
  • 1
    @PavoDive Thanks for the ping; updated. I thought I had hunted down the expired links to rawgit but apparently I missed some. – tripleee Jan 28 '20 at 12:34
  • @Azeem Thanks, fixed! Incredible that this typo went unnoticed for almost 5 years. – tripleee Sep 30 '20 at 06:47
  • @tripleee: You're welcome! :) Yes, it's strange that even after 10k+ views nobody noticed that. It's fixed now. Cheers! – Azeem Sep 30 '20 at 06:54