0

My input file is 2GB and in this file every line is a word. I need to write a program to do wordcount. I use Java and C++ do to the same task but result is surprising: C++ is too slow! My code are as follows:

C++:

int main() {

    struct timespec ts, te;
    double cost;
    clock_gettime(CLOCK_REALTIME, &ts);

    map<string, int> map;    
    ifstream fin("inputfile.txt");
    string word;
    while(getline(fin, word)) {
        ++map[word];
    }

    clock_gettime(CLOCK_REALTIME, &te);    
    cost = te.tv_sec - ts.tv_sec + (double)(te.tv_nsec-ts.tv_nsec)/NANO;
    printf("cost: %-15.10f s\n", cost);

    return 0;
}

Output: cost: 257.62 s

Java:

public static void main(String[] args) throws Exception {

    long startTime = System.currentTimeMillis();
    Map<String, Integer> map = new HashMap<String, Integer>();
    FileReader reader = new FileReader("inputfile.txt");
    BufferedReader br = new BufferedReader(reader);

    String str = null;
    while((str = br.readLine()) != null) {
        Integer count = map.get(str);
        map.put(str, count == null ? 1 : count + 1);
    }

    long endTime = System.currentTimeMillis();
    System.out.println("cost : " + (endTime - startTime)/1000 + "s");
}

Output: cost: 124 s

I delete the code inside the while, just read the file but do not do anything, the result is the same. Java cost: 32s, C++ cost: 38s. This gap I can accept. My environment is Ubuntu Linux 13.04 and C++ use -O2 optimation. Why does the STL perform poorly?

Jason Aller
  • 3,391
  • 28
  • 37
  • 36
dodolong
  • 755
  • 7
  • 14
  • This might help you, your c++ implementation is bad, it is a broken pattern `while(!EOF){...}` http://stackoverflow.com/questions/8736862/c-whats-the-most-efficient-way-to-read-a-file-into-a-stdstring – Bogdan M. Apr 09 '14 at 09:12
  • 4
    Note that you are comparing mushrooms and apples again. Java is doing many things completely different here. E.g. in java strings are immutable and just references, in C++ they are copied into the tree. – PlasmaHH Apr 09 '14 at 09:13
  • While Björn's mentioned some important things already, there are many other possible factors - e.g. if you've just read the file off magnetic disk with C++, the content may still be in memory cache for Java, `FileReader` and/or `BufferedReader` might use a background thread, larger buffers, memory mapped or other higher-efficiency I/O mechanisms to reduce the number of `read` calls to the OS. All these techniques are possible in C++ - you just have to choose the right mix of performance, simplicity, portability etc.. – Tony Delroy Apr 09 '14 at 09:36

1 Answers1

8

The C++ std::map is an ordered data-structure usually implemented as a tree. A fairer comparison would be between java.util.HashMap and std::unordered_map or java.util.TreeMap and std::map.

Björn Pollex
  • 70,106
  • 28
  • 177
  • 265