1

I am making a program in which is is frequently reading chunks of text received from the web looking for specific characters and parsing the data accordingly. I am becoming fairly skilled with C++, and have made it work well, however, is Assembly going to be faster than a

for(size_t len = 0;len != tstring.length();len++) {
    if(tstring[len] == ',')
        stuff();
}

Would an inline-assembly routine using cmp and jz/jnz be faster? I don't want to waste my time working with asm for the fact being able to say I used it, but for true speed purposes.

Thank you,

  • 9
    [`std::string::find()`](http://en.cppreference.com/w/cpp/string/basic_string/find) is a bit more appropriate than your loop. – chris Jul 20 '12 at 00:01
  • 7
    Don't even **think** about assembler until you've established that this section is a hotspot, and have exhausted all possible alternatives. – Oliver Charlesworth Jul 20 '12 at 00:03
  • 2
    In general, "premature optimization is the root of all evil" (Donald Knuth). In general, *don't* "optimize" unless you have solid evidence you need to (from profiling and benchmarking). But ... in this case... you actually *might* be able to do better with a C routine (or yes, maybe even some in-line assembler) than with the standard C++ string type. IMHO... – paulsm4 Jul 20 '12 at 00:07
  • Frankly, you can split the input buffer and operate on it with multiple threads in C if you're smart about it, so I don't know why you'd want to go assembly unless you're highly skilled in it or desperate :) – John Humphreys - w00te Jul 20 '12 at 00:12
  • I figured as much, so I'm glad I looked for advice first. I can imagine that it would complicate the process, and it is not necessarily a large performance issue. I'm no assembly expert, but the loop is partially necessary, due to the data being sent to a stringstream if it's not an break character. – Collin Biedenkapp Jul 20 '12 at 00:12
  • possible duplicate of [Using Assembly Language in C/C++](http://stackoverflow.com/questions/4202687/using-assembly-language-in-c-c) – Tony Delroy Jul 20 '12 at 00:33
  • 2
    @w00te - Are you SERIOUS? Split the input buffer & operate on it with multiple threads? REALLY? I am SO hoping you're just jesting. If your system does very little besides this text search operation AND you have multiple cores AND you know how many cores you have AND you create only as many threads as you have cores, THEN this would speed things up. If you create threads that operate on the same core, this solution will slow your search operation down, not speed it up. – phonetagger Jul 20 '12 at 00:42
  • 2
    @phonetagger (1) If you're doing massive data processing this is common. (2) Even my 3-4 year old lap top is multi core. If this is in use in industry this would be very common and if it's university it's a good practice exercise. (3) Why would you multi-thread and operate on the same core for data processing? That's idiotic. – John Humphreys - w00te Jul 20 '12 at 00:56
  • 1
    Collin: If you're downloading a couple "chunks" from the web, processing those, downloading a few more - it's overwhelmingly likely that the waits for new web content take millions of times longer than the processing: check your CPU utilisation and it will probably be <1%. But, maybe you mean you have a giga- or terabytes of data already downloaded that happens to have originally come from the web, and are ready to do some serious data crunching on it? – Tony Delroy Jul 20 '12 at 02:48
  • @Collin - If you are processing comma separated values, finding the commas is unlikely to be the most expensive part. – Bo Persson Jul 20 '12 at 12:15

4 Answers4

3

No way. Your loop is so simple, the cost of the optimizer losing the ability to reason about your code is going to be way higher than any performance you could gain. This isn't SSE intrinsics or a bootloader, it's a trivial loop.

Puppy
  • 138,897
  • 33
  • 232
  • 446
  • SSE can make scanning a string for a single character like this significantly faster. (But the question was about cmp so you're not wrong.) – SoapBox Jul 20 '12 at 01:02
  • Indeed. That would completely depend on `stuff()`, though. However, I do accept that automatic vectorization is not a strong point of compilers and going down to intrinsics or hand-written assembly is not especially unreasonable. – Puppy Jul 20 '12 at 05:42
1

Checking characters one by one is not the fastest thing to do. Maybe you should try something like this and find out if it's faster.

string s("xxx,xxxxx,x,xxxx");
string::size_type pos = s.find(',');  
while(pos != string::npos){
    do_stuff(pos);
    pos = s.find(',', pos+1);       
}

Each iteration of the loop will give you the next position of a ',' character so the program will need only few loops to finish the job.

milan-j
  • 591
  • 6
  • 15
  • That just checks the characters one by one as well. There's no other way to search an arbitrary string for a single character. – Seth Carnegie Jul 20 '12 at 02:45
  • 1
    Yea, like I said, I don't know how find method works internally, but I believe it is well optimized for the task. He can always check the speed and find out, I was just trying to give an alternative solution. – milan-j Jul 20 '12 at 02:53
  • 1
    I will use this loop instead, and stay out of ASM for now. Later on in the project, I will be reading through possible hundreds of small files where every line of data matters, and a mistake is much too easy for me to make in ASM for that. – Collin Biedenkapp Jul 20 '12 at 03:55
  • @CollinBiedenkapp You could also define do_stuff() as an inline function, because regular function call does a lot of work like saving registers, copying arguments and branching program to a new location. An inline function is expanded "in line" which avoids function call overhead. But be aware that inline specification is only a request to the compiler, which he may choose to ignore. – milan-j Jul 22 '12 at 13:12
1

An inline assembly routine using "plain old" jz/jnz is unlikely to be faster than what you have; that said, you have a few inefficiencies in your code:

  • you're retrieving tstring.length() once per loop iteration; that's unnecessary.
  • you're using random indexing, tstring[len] which might be a more-expensive operation than using a forward iterator.
  • you're calling stuff() during the loop; depending on what exactly that does, it might be faster to just let the loop build a list of locations within the string first (so that the scanned string as well as the scanning code stays cache-hot and is not evicted by whatever stuff() does), and only afterwards iterate over those results.

There's already a likely low-level optimized standard library function available,strchr(), for exactly that kind of scanning. The C++ STL std::string::find() is also likely to have been optimized for the purpose (and/or might use strchr() in the char specialization).

In particular, strchr() has SSE2 (using pcmpeqb, maskmov... and bsf) or SSE4.2 (using the string op pcmpistri) implementations; for examples/actual SSE code doing this, check e.g. strchr() in GNU libc (as used on Linux). See also the references and comments here (suitably named website ...).

My advice: Check your library implementation / documentation, and/or the actual generated assembly code for your program. You might well be using fast code already ... or would be if you'd switch from your hand-grown character-by-character simple search to just using std::string::find() or strchr().
If this is ultra-speed-critical, then inlining assembly code for strchr() as used by known/tested implementations (watch licensing) would eliminate function calls and gain a few cycles. Depends on your requirements ... code, benchmark, vary, benchmark again, ...

FrankH.
  • 16,133
  • 2
  • 36
  • 54
0

Would an inline-assembly routine using cmp and jz/jnz be faster?

Maybe, maybe not. It depends upon what stuff() does, what the type and scope of tstring is, and what your assembly looks like.

First, measure the speed of the maintainable C++ code. Only if this loop dominates your program's speed should you consider rewriting it.

If you choose to rewrite it, keep both implementations available, and comparatively measure them. Only use the less maintainable version if it is faster, and if the speed increase matters. Also, since you have the original version in place, future readers will be able to understand your intent even if they don't know asm that well.

Robᵩ
  • 143,876
  • 16
  • 205
  • 276