0

So, I was writing a program for static cryptanalysis, when faced the unusual behavior.

First, I wrote the counter of characters, and that's where the problem appears.

I've got the file:

//alphabet.txt
abcdefghijklmnopqrstuvwxyz

and when I tried to count letters frequency, got some interesting result! (Don't pay attention to first line, it just says that ifstream is open)

D:\Workspaces\EclipseWS\StaticAnalysis\src>g++ main.cpp

D:\Workspaces\EclipseWS\StaticAnalysis\src>a.exe
1!
a       1
b       1
c       1
d       1
e       1
f       1
g       1
h       1
i       1
j       1
k       1
l       1
m       1
n       1
o       1
p       1
q       1
r       1
s       1
t       1
u       1
v       1
w       1
x       1
y       1
z       2

As you can see, program says that 'z' appears 2 times in the text, but it doesn't.

Now, to technical details.

Operating System: Windows 10 Enterprise LTSC
System Type: 64bit
C++ compiler: MinGW
alphabet.txt File encoding: ANSI

Code, I parsed the file with:

#include <fstream>
#include <iostream>
#include <map>
#include <string>

using namespace std;

#define encrypted_fname "D:\\Workspaces\\EclipseWS\\StaticAnalysis\\src\\alphabet.txt"

void printMap(const map<char,int>& m){
    for (const auto& p : m){
        cout<<p.first<<'\t'<<p.second<<endl;
    }
}

int main(){
    ifstream ifs(encrypted_fname);
    cout << ifs.is_open() << "!\n";
    map<char,int> letterCount;

    char buffer;

    while(ifs){
        ifs >> buffer;
        letterCount[buffer]+=1;
    }

    printMap(letterCount);

}

I tried to change file encoding to UTF-8

1!
»   1
¿   1
ï   1
a   1
b   1
c   1
d   1
e   1
f   1
g   1
h   1
i   1
j   1
k   1
l   1
m   1
n   1
o   1
p   1
q   1
r   1
s   1
t   1
u   1
v   1
w   1
x   1
y   1
z   2

..Unicode big endian\unicode..

1!
þ   1
ÿ   1
    28
a   1
b   1
c   1
d   1
e   1
f   1
g   1
h   1
i   1
j   1
k   1
l   1
m   1
n   1
o   1
p   1
q   1
r   1
s   1
t   1
u   1
v   1
w   1
x   1
y   1
z   1

But as you can see, output is a bit lame in every case!

I can provide more information if you need, just say me how to get it.

My main questions are: why does it happen? Do I have a mistake in the code? How to fix it?

gdl68
  • 57
  • 5
  • Side note: `char buffer` is signed (negative for any character above 127). Watch out when you use it as index to an array. – goodvibration Oct 27 '19 at 05:42
  • And I believe that `ifstream` will open your file for read as ascii (1 byte per character), no matter what encoding your file is set to. It's just that if you change your file encoding, then some additional information will be added (implicitly) into that file, which will then appear to you as "weird characters" when you read it with `ifstream`. – goodvibration Oct 27 '19 at 05:44
  • 4
    You problem is related to the fact, I think, that `while(ifs)` is true when `EOF` character is reached, but then `ifs >> buffer` doesn't actually loads it into `buffer`, which remains holding its last value (which happens to be `z`). See [this answer](https://stackoverflow.com/a/5605159/7400903) for more details. – goodvibration Oct 27 '19 at 05:48
  • 2
    ^^^ iow. change you're while-condition to be `while (ifs >> buffer)` and remove the `ifs >> buffer` from within the loop itself. – WhozCraig Oct 27 '19 at 06:15
  • @goodvibration "*`char buffer` is signed*" - that is dependent on compiler implementation. `char` MAY be signed OR unsigned. – Remy Lebeau Oct 30 '19 at 20:40
  • @RemyLebeau: I'm pretty sure that `char` is synonymous to `signed char` by the C-language standard (while `unsigned char` must be stated explicitly). Though I admit I haven't looked into the standard in order to verify this. – goodvibration Oct 30 '19 at 20:44
  • @goodvibration [Is char signed or unsigned by default?](https://stackoverflow.com/questions/2054939/) Neither C nor C++ standards define whether plain `char` is signed or unsigned, that is left up to the compiler implementation to decide. – Remy Lebeau Oct 30 '19 at 20:53

1 Answers1

0
 while (1) {
        ifs >> buffer;
        if (! ifs) break;
        letterCount[buffer]+=1;
    } ;

should do it. "ifs" won't be false if the last read was successful, only whent trying to read beyond its end, it will be evaluated as "false". But at that point, "z" being the last letter is not overwritten by anything, and is counted twice.

This syntax for reading files was an attempt on the time it was created to have a "nice interface" for text reading. I'd say it can't beat calling read(ifs, &buffer, 1) with a file.

Your other strange results ared ue to you always reading bytes from the file, and never trying to decode these bytes to text - so both the BOM marker and the extra byte for 2-byte encoding you are calling "unicode big endian" (the correct name would be "utf-16 big endian") are having all its bytes counted as if they were characters.

jsbueno
  • 77,044
  • 9
  • 114
  • 168