0

So my program is to count the frequency of words in a text file. The text file will be partitioned in to n parts and use n threads to count the frequency of words from each part. Assume the text file contains only letters and white spaces, and the uppercase letters are the same as lowercase letters.

An example of the text file is

This is a text file which contains some numbers from one to ten some numbers are lower than the other numbers such as one is lower than two

My problem is when reading a part of the text file using fseek and fread, it doesn't read properly. I'm using start to indicate from which position to start reading and end to indicate the number of bytes to read. Although I check the start and end, but the strings that I get aren't correct.

For example,

From 0 to 48, the string is 'This is a text file which contains some numbers '

From 48 to 92, the string is 'from one to ten some numbers are lower than '

From 92 to 139, the string is 'This is a text file which contains some numbers'

Also, the variables that is returned from thread_exit() don't seem to be correct as well. I made a check and the words and their frequencies that I got from wordFreq were not the same as what I got from the returned variable from thread_exit(), myfreqword

for example, this is what I got from wordFreq of each thread

a 1

contains 1

are 1

file 1

as 1

from 1

is 1

is 1

lower 1

numbers 1

lower 1

numbers 1

some 1

numbers 1

one 1

text 1

one 1

some 1

this 1

other 1

ten 1

which 1

such 1

than 1

than 1

to 1

And this is what I got after return from thread_exit(), there were some weird strings here.

the 1

a 1

two 1

contains 1

numbers 1

some 1

which 1

are 1

from 1

lower 1

numbers 1

one 1

some 1

ten 1

than 1

to 1

is! 1

? 1

??? 1

tw 1

I have no idea what went wrong here.

Community
  • 1
  • 1
user3398928
  • 43
  • 1
  • 6
  • You really should have created an [MCVE](http://stackoverflow.com/help/mcve) -- I don't think all of the sorting and counting is relevant here. Does it work with a single thread? File reading may be buffered, and several threads may attempt to read the file "at once". Try locking the file before `fseek` and releasing it after your `fread` (a quick fix is using a single global variable). – Jongware Oct 12 '14 at 00:14
  • Your results look entirely consistent with what I would expect to happen if you had 3 threads all trying to seek on the same open file. – Andrew C Oct 12 '14 at 00:17
  • It works with a single thread. So it seems like I need to use pthread_mutex to lock and unlock. – user3398928 Oct 12 '14 at 00:55
  • 1
    the problem is probably *IO-bound* i.e., unless the file is in cache; you won't see any time performance improvements from utilising multiple CPUs to count words in it. For example, [the solution for a similar (computationally) problem (sum all integers in a file) **does not** benefit from multiple threads if the input file is not in OS cache](http://stackoverflow.com/questions/25606833/fastest-way-to-sum-integers-in-text-file#comment40064167_25606833) – jfs Oct 12 '14 at 02:58

0 Answers0