Alphabet occurence in a text file

Question

I am beginner program as the code probably suggests currently writing a program that will count of each letter in a text file and notes how often each of the letters of the alphabet occurs. Currently I have only written it that the code will count the number of A's in the text file. However, I still need to count the frequency of the other 25 letters. Without using any fancy methods, is there an easy way to automate it instead of repeating the block of code for each letter?

#include <fstream> 
#include <iostream> 
#include <string>
using namespace std; 


    while (!file.eof()) 
    {
        
    cout << " letter     Frequency"<< endl;
    for (char c = 'A'; c <= 'Z'; ++c)
    {
        cout << "    " << c << "    :      " << Counts[c - 'A'] << endl;
    }
    
    // Return code

    return 0;
}

using an HashMap you could solve in linear time, just looping trough the array, saving the number -> number of times for each number in the array. — Andri Nic, Mar 31 '21 at 12:16
Well, the code that you showed demonstrates that you are already familiar with basic concepts like arrays and loops. Shouldn't it be obvious that instead of one counter you simply have an array of 26 counters? You initialize them all to 0, using a simple loop, then read the file one character at a time and after checking that each character is a letter, you simply increment the corresponding counter? — Sam Varshavchik, Mar 31 '21 at 12:30
You seem to have most of the correct bits and pieces, but put together strangely - it looks like you added the array as an afterthought, as if someone said "store the occurrences of each character in an array and then print them" but you heard "store the characters in an array and then print them". — molbdnilo, Mar 31 '21 at 12:32
On an unrelated bug: [Why is iostream::eof inside a loop condition (i.e. `while (!stream.eof())`) considered wrong?](https://stackoverflow.com/questions/5605125/why-is-iostreameof-inside-a-loop-condition-i-e-while-stream-eof-cons). — molbdnilo, Mar 31 '21 at 12:34
Use an array of 256 ints: `int counters[256] = {0};` and in the loop where you read the file just do `counters[ch]++;` where `ch` is the char you read from the file and you're done. — Jabberwocky, Mar 31 '21 at 12:36
I understand the concepts and what I'm supposed to do in theory however I'm struggling in formatting it into a working piece of code — Riko Tri, Mar 31 '21 at 12:45

score 1 · Answer 1 · answered Mar 31 '21 at 12:30

1

You can use std::map , keys will be chars and values will be counts.

answered Mar 31 '21 at 12:30

Tuna

99
8

Joseph Willcoxson · Answer 2 · 2021-03-31T21:13:00.587

1

Well, if you look at an ASCII table, you will see that 'A'to'Z' and 'a' to 'z' are in sequence. 'A' = 65, 'Z' = 90, 'a' = 97, 'z' = 122. The tricky part is there is a gap between 'Z' and 'a' of 7.

You can create an array for lower and upper case like this:

int lowerCounts[26] = {0}; // might work in initialization...if not, get right syntax of use memset
int upperCounts[26] = {0};

so you scan the file for each letter 'ch'

if (ch >= 'A' && ch <= 'Z')
   ++upperCounts[ch - 'A'];
else if (ch >= 'a' && ch <= 'z')
   ++lowerCounts[ch - 'a'];

If this was case insensitive, meaning treat 'a' == 'A', then stick with all upperCounts.

Change the above if statement to:

if (ch >= 'A' && ch <= 'Z')
   ++upperCounts[ch - 'A'];
else if (ch >= 'a' && ch <= 'z')
   ++upperCounts[ch - 'a']; // using upperCounts array instead of lowerCounts

And of course, you could get rid of all references to lowerCounts at all.

To spit out the counts, do something like,

for (char c = 'A'; c <= 'Z'; ++c)
{
   cout << c << " count = " << upperCounts[c - 'A'] << endl;
}

You might could use a vector or something or a map, but I think at your level, this type of solution is more appropriate to your current understanding and skill--you're just learning.

edited Mar 31 '21 at 21:13

answered Mar 31 '21 at 14:26

Joseph Willcoxson

5,095
1
14
23

Hello, I've tried to implement this- 'int lowerCounts[26] = { 0 }; int upperCounts[26] = { 0 }; int addCounts[26] = { 0 }; while (!file.eof()) { file.get(ch); if (ch >= 'A' && ch <= 'Z') ++upperCounts[ch - 'A']; else if (ch >= 'a' && ch <= 'z') ++lowerCounts[ch - 'a']; } for (char c = 'A'; c <= 'Z'; ++c) { cout << c << " count = " << upperCounts[c - 'A'] << endl; } for (char c = 'a'; c <= 'z'; ++c) { cout << c << " count = " << lowerCounts[c - 'a'] << endl; } However I'm struggling to add the two counts into one output. What do you suggest – Riko Tri Mar 31 '21 at 20:41
You are wanting 'A' and 'a' to count together? It's case insensitive? – Joseph Willcoxson Mar 31 '21 at 21:08
See my edit. If it's case insensitive, you'd make the change I showed... and instead of calling it upperCounts, you'd probably go with letterCounts as the variable name or something like that. – Joseph Willcoxson Mar 31 '21 at 21:14
Thank you very much that's very helpful! I'm running into one more problem which is whichever the last character is in the text file, the program will add an extra value for count for that specific character. Do you know what could be causing that? Again, appreciate your help. EDIT- I have updated my code to what I have currently – Riko Tri Mar 31 '21 at 21:37
You need to step through with a debugger and see if you are reading the last character twice or any funky business. Maybe set ch to zero before each read, just in case... But, you need to learn how to step through the debugger. Standard hot key mappings for visual c++ is F10 to step line by line. Do that. Look at https://docs.microsoft.com/en-us/visualstudio/debugger/getting-started-with-the-debugger-cpp?view=vs-2019 . If I were you, learn to use the debugger. Google youtube for "how to use visual c++ debugger" – Joseph Willcoxson Mar 31 '21 at 22:01

Thomas Matthews · Answer 3 · 2021-03-31T17:55:27.070

This assignment is a good exercise to optimizing the I/O.
The file will be read into a block of memory, a.k.a. buffer.

Let's use an array for the frequency counting, as it's an optimal technique.

#include <iostream>
#include <fstream> 

// Declare the size of the buffer.
static const unsigned int BUFFER_SIZE = 1024*1024;  

int main()
{
    // Declare the buffer as "static" to use a different memory area.
    static char buffer[BUFFER_SIZE];

    /* Use the same file opening as in your original code. */

    while (file.read(buffer, BUFFER_SIZE))
    {
        const unsigned int characters_read = file.gcount();
        for (unsigned int i = 0; i < characters_read; ++i)
        {
            const char ch = buffer[i];
            if (ch >= 'A' && ch <= 'Z')
            {
                ++upperCounts[ch - 'A'];
            }
            else
            {
                if (ch >= 'a' && ch <= 'z')
                {
                    ++lowerCounts[ch - 'a'];
                }
            }
        }
    }
    /* Insert code to print frequencies */
    return 0;  // Indicate success to the operating system.
}

In the above code, a block of characters is read into memory using the read() method. Reading in blocks is always faster than reading one character at a time. Although the C++ streaming facilities may buffer the input already, we're taking control so we can set the buffer size.

The buffer is then searched for alphabetic characters and the frequency counts updated. Searching in memory is always faster than searching a file.

Edit 1: Optimizing the calculation
In the code above and in the OP's code, most of the execution time is spent calculating the frequency (by using compare's).

We can save more time by moving the specialization to after the input and counting the frequency of all characters.

unsigned int frequencies[256] = {0}; // Possible range of characters.

while (file.read(buffer, BUFFER_SIZE))
{
    const unsigned int characters_read = file.gcount();
    for (unsigned int i = 0; i < characters_read; ++i)
    {
        ++frequencies[i];
    }
}

// Now print out the frequencies:  
for (char ch = 'A'; ch <= 'Z'; ++ch)
{
    std::cout << ch << ": " << frequencies[ch] << "\n";
}
for (char ch = 'a'; ch <= 'z'; ++ch)
{
    std::cout << ch << ": " << frequencies[ch] << "\n";
}

In the above code, the input loop has been simplified to one purpose: calculating frequencies. No need to check for ranges; range checking is performed after the input.

After input, all the frequencies are output for the alphabetic characters, and only the alphabetic characters.

This example shows that the program can run faster by making operation general during the most frequently executed section. The specialization or details are performed after or outside the high performance section.

Alphabet occurence in a text file

3 Answers3