2

I'm not sure why I'm having easier time string searching in program I wrote in python faster than a program I wrote in C++. Is there a trick I'm missing?

Generating Use Case

This is for a single line use case, however in the real use case I care about multiple lines.

#include "tchar.h"
#include "stdio.h"
#include "stdlib.h"
#include <string>
#include <sstream>
#include <iostream>
#include <fstream>
#include <ctime>

using namespace std;
void main(void){
   ofstream testfile;
   unsigned int line_idx = 0;
   testfile.open("testfile.txt");
   for(line_idx = 0; line_idx < 50000u; line_idx++)
   {
      if(line_idx != 43268u )
      {
        testfile << line_idx << " dontcare" << std::endl;
      }
      else
      {
        testfile << line_idx << " care" << std::endl;
      }
   }
   testfile.close();
}

The regular expression Using regular expression ^(\d*)\s(care)$

The C++ Program takes 13.954 seconds

#include "tchar.h"
#include "stdio.h"
#include "stdlib.h"
#include <string>
#include <sstream>
#include <iostream>
#include <fstream>
#include <ctime>
using namespace std;

void main(void){
   double duration;
   std::clock_t start;
   ifstream testfile("testfile.txt", ios_base::in);
   unsigned int line_idx = 0;
   bool found = false;
   string line;
   regex ptrn("^(\\d*)\\s(care)$");

   start = std::clock();   /* Debug time */
   while (getline(testfile, line)) 
   {
      std::smatch matches;
      if(regex_search(line, matches, ptrn))
      {
         found = true;
      }
   }
   testfile.close();
   duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;
   std::cout << "Found? " << (found ? "yes" : "no") << std::endl;
   std::cout << " Total time: " <<  duration << std::endl;
}

Python Program takes 0.02200 seconds

import sys, os       # to navigate and open files
import re            # to search file
import time          # to benchmark

ptrn  = re.compile(r'^(\d*)\s(care)$', re.MULTILINE)

start = time.time()
with open('testfile.txt','r') as testfile:
   filetext = testfile.read()
   matches = re.findall(ptrn, filetext)
   print("Found? " + "Yes" if len(matches) == 1 else "No")

end = time.time()
print("Total time", end - start)
Iancovici
  • 4,946
  • 6
  • 33
  • 54
  • 1
    try to `break` when you found a match in your C++ loop... – Jean-François Fabre Apr 04 '18 at 12:25
  • 1
    also declare `std::smatch matches;` outside the loop to avoid it being constructed over and over – Jean-François Fabre Apr 04 '18 at 12:26
  • @Jean-FrançoisFabre, it's applicable in this use case, but it's not the same for the real use case becuase there are multiple lines I care about. I'll update question – Iancovici Apr 04 '18 at 12:26
  • 7
    other difference: you're performing only 1 call to regex in python because you're reading all the file at once. With C++ code you're calling it once per line. – Jean-François Fabre Apr 04 '18 at 12:26
  • 2
    13 seconds there's clearly a problem! – Jean-François Fabre Apr 04 '18 at 12:27
  • 5
    what flags did you use to compile? Somehow everyone always thinks that debug builds should be fast in c++.... – UKMonkey Apr 04 '18 at 12:29
  • the `getline` also performs a lot of string allocation/copies. Try to make your C++ code as close as possible as the python code first. – Jean-François Fabre Apr 04 '18 at 12:35
  • Can you just try something like this for reading [link]https://stackoverflow.com/questions/2602013/read-whole-ascii-file-into-c-stdstring – Ratah Apr 04 '18 at 12:38
  • Which compiler do you use? I tried g++ and it gave me 0.042s on your code with no modifications except g++-specific fixes like `int main` – Denis Sheremet Apr 04 '18 at 12:41
  • @Ratah, great! brought it down to 8.923 seconds – Iancovici Apr 04 '18 at 12:42
  • @DenisSheremet Visual Studio – Iancovici Apr 04 '18 at 12:42
  • 1
    @UKMonkey I think you're right, I'm in debug mode, no opitmizaiton. Need to resolve some release mode errors before testing it – Iancovici Apr 04 '18 at 12:43
  • I've closed the question, as it seems a related problem. Don't hesitate to [edit] it if you still have unsolved issues and tell me so I can reopen (or not ;)) – Jean-François Fabre Apr 04 '18 at 12:58
  • 1
    @Jean-FrançoisFabre, Thanks this was resolved thanks to UKMonkey, and props for Ratah for imporvement – Iancovici Apr 04 '18 at 13:02
  • 1
    One of the biggest differences is due to the less efficient regex implementation on VS. By using [`boost::regex`](https://stackoverflow.com/q/14205096/1460794) the time went from 0.052 down to 0.005 for C++ vs 0.024 for Python. – wally Apr 04 '18 at 14:39
  • There are many different regex libraries for C and C++, with different featuresets and different strategies. Performance varies wildly depending on which library, what regex you have, what the input looks like, and if you apply the regex line by line or just have it scan the whole input in one go. Also, be aware that in most cases, the regex will be compiled before you use it to match some input, and that the compilation time itself can be substantial, which matters if you only do a few searches with a given regex. – G. Sliepen Apr 04 '18 at 18:01
  • Please post the solution as an **answer**, not as a question edit. – Wiktor Stribiżew Apr 04 '18 at 21:34

1 Answers1

0

Implemented Ratah's recommendation to 8.923

about 5 seconds improvement, by reading file to single string

   double duration;
   std::clock_t start;
   ifstream testfile("testfile.txt", ios_base::in);
   unsigned int line_idx = 0;
   bool found = false;
   string line;
   regex ptrn("^(\\d*)\\s(care)$");
   std::smatch matches;

   start = std::clock();   /* Debug time */
   std::string test_str((std::istreambuf_iterator<char>(testfile)),
                 std::istreambuf_iterator<char>());

   if(regex_search(test_str, matches, ptrn))
   {
      found = true;
   }
   testfile.close();
   duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;
   std::cout << "Found? " << (found ? "yes" : "no") << std::endl;
   std::cout << " Total time: " <<  duration << std::endl;

After UKMonkey's note, reconfigured project to release which also includes \O2 and brought it down to 0.086 seconds

Thanks to Jean-Francois Fabre, Ratah, UKMonkey

Iancovici
  • 4,946
  • 6
  • 33
  • 54