0

I am processing CSV and using the following code to process a single line.

play with code

std::vector<std::string> string_to_vector(const std::string& s, const char delimiter, const char escape) {
  std::stringstream sstr{s};
  std::vector<std::string> result;
  while (sstr.good()) {
    std::string substr;
    getline(sstr, substr, delimiter);
    while (substr.back() == escape) {
      std::string tmp;
      getline(sstr, tmp, delimiter);
      substr += "," + tmp;
    }
    result.emplace_back(substr);
  }
  return result;
}

What it does: Function breaks up string s based on delimiter. If the delimiter is escaped with escape the delimiter will be ignored.

This code works but is super slow. How can I speed it up?

Do you know any existing csv processing implementation that does exactly this and which I could use?

User12547645
  • 3,123
  • 20
  • 33
  • 1
    FWIW, You should move `getline(sstr, substr, delimiter)` into the while condition instead of using `sstr.good()`. You should always use the read operation to control reading loops. More at: https://stackoverflow.com/questions/5605125/why-is-iostreameof-inside-a-loop-condition-i-e-while-stream-eof-cons – NathanOliver Jul 31 '20 at 18:24
  • 1
    `while (sstr.good())` is not so good. – Waqar Jul 31 '20 at 18:25
  • 1
    You can `move` the `substr` into the results vector – Waqar Jul 31 '20 at 18:26
  • You could eliminate the usage of `stringstream` and use the `string::find` and `string::substr` methods. Looks like an extra copy or duplicate effort when using `std::stringstream`. – Thomas Matthews Jul 31 '20 at 18:27
  • 3
    "How can I speed it up?" - Compile your code with compiler optimizations *enabled* is the obvious first suggestion (note: that's usually not the default when you build your code, you need to take explicit steps to do so). – Jesper Juhl Jul 31 '20 at 18:29
  • @NathanOliver and Waqar: Okay. Did this. Thank you for the advice – User12547645 Jul 31 '20 at 18:31
  • @Waqar How should I move the substr in the result vector? – User12547645 Jul 31 '20 at 18:32
  • 2
    `result.emplace_back(std::move(substr));` – john Jul 31 '20 at 18:59
  • @john totally agree! Thanks! – User12547645 Jul 31 '20 at 19:00

1 Answers1

5

The fastest way to do something is to not do it at all.

If you can ensure that your source string s will outlive the use of the returned vector, you could replace your std::vector<std::string> with std::vector<char*> which would point to the beginning of each substring. You then replace your identified delimiters with zeroes.

[EDIT] I have not moved up to C++17, so no string_view for me :)

NOTE: typical CSV is different from what you imply; it doesn't use escape for the comma, but surrounds entries with comma in it with double quotes. But I assume you know your data.

Implementation:

#include <iostream>
#include <vector>
#include <string>

std::vector<char*> string_to_vector(std::string& s, 
                                    const char delimiter, const char escape) 
{
  size_t prev(0), pos(0), from(0);
  std::vector<char*> v;
  while ((pos = s.find(delimiter, from)) != s.npos)
  {
    if (pos == 0 || s[pos - 1] != escape)
    {
      s[pos] = 0;
      v.push_back(&s[prev]);
      prev = pos + 1;
    }
    from = pos + 1;
  }
  v.push_back(&s[prev]);
  return v;
}

int main() {
  std::string test("this,is,a\\,test");
  std::vector<char*> v = string_to_vector(test, ',', '\\');

  for (auto& s : v)
    std::cout << s << " ";
}
Vlad Feinstein
  • 7,028
  • 1
  • 8
  • 18
  • 1
    Even more C++: Use std::string_view instead of char*. – Timbo Jul 31 '20 at 18:54
  • Could you guys provide the full example. AFAIK string_views are just `char*` with a size attribute. So both should be fine – User12547645 Jul 31 '20 at 18:55
  • With string_view, there is no need to modify the original character storage. – Timbo Jul 31 '20 at 19:09
  • The problem with `char*` and `std::string_view` is the allocation/storage policy: how data is allocated and who is the ownership? Returning a `std::vector` or a `std::vector` seems great at first glance, but it is the programmer's responsibility to ensure that all `std::string_view`/`char*` does not outlive the pointed-to character array. Since to the delimiter not being always a "," in this implementation (see the concatenation), new data has to be allocated and stored somewhere as long as the vector is used. – Jérôme Richard Jul 31 '20 at 19:28
  • I think I might be able to do that @JérômeRichard. Could you provide me with a working example? – User12547645 Jul 31 '20 at 20:39
  • @JérômeRichard it wouldn't be hard to make a class that makes its own copy of the string to ensure it remains valid, then provides `operator[]` to make it look like a `vector`. You could even provide `begin()` and `end()` that return a `vector::iterator` using the private `vector` contained within the object. – Mark Ransom Jul 31 '20 at 22:17
  • @MarkRansom - that would be an elegant solution. In my defense, the OP did ask for "fast" and tagged this with `performance`... :) – Vlad Feinstein Jul 31 '20 at 22:46
  • 1
    @MarkRansom Sure, but the resulting type would be quite different from the original requirement (which may not be a requirement). I mainly wanted to point out that the big string to create is far from being free. I am wondering if this would be much faster than adding custom allocator for strings, avoiding copies with `std::move` and parsing the string with something faster than `std:stringstream` (based on the original code). This being said, this is an interesting approach! – Jérôme Richard Aug 01 '20 at 08:01