2

I want to create a function to split a string into substrings of equal length n character-by-character and return a character vector.

e.g. F('atgctgttg',n=5) should return

'atgct','tgctg','gctgt','ctgtt','tgttg'

I tried two different functions:

// [[Rcpp::export]]
CharacterVector f( const std::string str, const int n ) {
    int lim = str.length() - n + 1;
    CharacterVector result( lim );
    for ( int j = 0; j < lim; j++ )
    { 
        result[j] = str.substr( j, n );
    }
    return result;
}

and

// [[Rcpp::export]]
CharacterVector f1( const std::string str, const int n ) {
    const int lim = str.length();
    const int n1 = n - 1;
    CharacterVector result( lim - n1 );
    int j = 1;
    std::string tmp = str.substr( 0, n );
    result[0] = tmp;

    for ( int i = n; i < lim; i++ )
    {
        tmp.erase( 0, 1 );
        tmp.push_back( str[i] );
        result[j] = tmp;
        j++;
    }
    return result;
}

I also tried using an iterator but it wasn’t faster than function f1. Note that Rcpp transforms inputs into reference variables. Is there a faster way to do this?

  • None that I see at the moment. I am almost afraid to ask, but is `f1()` actually *faster* than `f()`? That is *terrible* code right there... – DevSolar Jan 12 '16 at 15:07
  • 1
    You didn't specify what `CharacterVector` is, and passing `std::string` by `const&` makes more sense than just `const`. – LogicStuff Jan 12 '16 at 15:09
  • Do you have to have a list? Or just an iterator? Custom iterator may be faster... –  Jan 12 '16 at 15:09
  • Yes ,but f1 is faster because it don't call str every iteration such f do – Mohamed Ezzeddine Macherki Jan 12 '16 at 15:11
  • If I call a list ,I will have to unlist ! – Mohamed Ezzeddine Macherki Jan 12 '16 at 15:14
  • 2
    Assuming that `CharacterVector` is a typedef for a `std::vector`, you may want to call `reserve` before any push backs, as opposed to initializing it with all empty strings. – AndyG Jan 12 '16 at 15:16
  • 2
    @MohamedEzzeddineMacherki: Have you *measured* `f1` to be faster, or are you *assuming* that? Have you measured that you *have* a performance problem here (which I find hard to believe)? Have you tried `const std::string & str` instead of `const std::string str`, which probably gives you more performance boost than any funny things you could pull off in `f2`? And, what AndyG said... – DevSolar Jan 12 '16 at 15:16
  • Why I do while all parameters are finite – Mohamed Ezzeddine Macherki Jan 12 '16 at 15:18
  • 1
    Maybe `std::experimental::basic_string_view` is available for your compiler? See the [reference](http://en.cppreference.com/w/cpp/experimental/basic_string_view), and another [SO post](http://stackoverflow.com/questions/20803826/what-is-string-view) on the subject. – mindriot Jan 12 '16 at 15:32
  • F1 is faster you can try it. & is not obligated in rcpp – Mohamed Ezzeddine Macherki Jan 12 '16 at 15:33
  • Std::experimental seems complicated and I don't get a good exemple! – Mohamed Ezzeddine Macherki Jan 12 '16 at 15:36
  • Also see http://stackoverflow.com/questions/13319858/slice-a-string-at-consecutive-indices-with-r-rcpp. – Kevin Ushey Jan 12 '16 at 18:09

3 Answers3

1

First, there's a problem with your function signature:

CharacterVector f( const std::string str, const int n )

You're passing the string by value, in every call of the function there will be a copy a the string (unless you are passing movable strings using C++11). It's better to pass the string by const reference const std::string& str.

With respect to the question, there's two possible answer, that came to mind.

  1. Return in fact copies of the characters of the inputs string. In this case iterating the string by index and inserting a new string in the structure as in code example 1 should be fast (the faster possible is only 1 copy, the copy of the substring to the structure).
  2. Return a structure of pointer to the real string. Ex: return proxy object that contain (start,end) of the substring in the string. The advantage would be that is not copy of string. Ex:

Code (tested: GCC 4.9.2 with C++11)

#include <iostream>
#include <vector>

struct string_ref {
    const char* start;
    const char* end;
};

// [[Rcpp::export]]
std::vector<string_ref> f(std::string&&, const int) = delete; // disallow calls with temporaries
// [[Rcpp::export]]
std::vector<string_ref> f(const std::string& str, const int n) {
    int lim = str.length() - n + 1;
    std::vector<string_ref> result(lim);
    for (int j = 0; j < lim; j++) {
        result[j] = { &str[j], &str[j + n] };
    }
    return result;
}

int main() {
    std::string input{"atgctgttg"};
    auto result = f(input, 5);
    for (const auto r : result) {
        std::cout << std::string(r.start, r.end) << std::endl;
    }
    return 0;
}

This method is used by many libraries that parse text (ex: lexers, regex engines, etc...). There is a proposed type std::string_view for the C++17, to reference partial or all string characters.

According to the comment in the code, you are implementing the function to use in R (don't known exactly), in this case this second solution probable could bring problems with memory access (the input string memory need to be accessible and live, when using the substring pointers). If the input string is created in R and call to F, is probable that the returning pointer would be valid, the better proof is to tested.

Of the code 2 examples in the question. The first would be the faster, because in the second in every loop, there's an erase and push_back of a character (erasing the first character most probably required a copy of all the other characters of the string in most STL implementations), the push_back could require to expand the memory of the string in some cases.

NetVipeC
  • 4,284
  • 1
  • 14
  • 19
  • I think in your example you might want to use a plain lvaue reference instead of a `const &`. I am not 100% certain but if you call the function with a temporary string then you are storing pointers to a buffer that gets deleted after the function call. – NathanOliver Jan 12 '16 at 15:44
  • Yes in case of passing temporary string, the pointer points to bad memory, thx very much, code updated, deleting the overload with lvalue parameter. – NetVipeC Jan 12 '16 at 15:54
  • Thinks for ideas but I tried this code using source Cpp but it don't work.I think there is a difference between rcpp and cpp structure – Mohamed Ezzeddine Macherki Jan 12 '16 at 16:05
  • most probably is c++11 support, this code could be modify to c++03: don't use auto (use actual type), don't use delete (add body with `assert(false);`). – NetVipeC Jan 12 '16 at 16:27
1

The approach I would use is to create an iterator to the the start of the string an an iterator to the one past then end of the first sub string. Then using a std::vector use emplace_back() to construct a string at the end of the vector that is the sub string. Then increment both of the iterators until you reach the end.

std::vector<std::string> splitString(const std::string& str, std::size_t len)
{
    if (len >= str.size())
        return { str };
    auto it = str.begin();
    auto end = it + len;
    std::vector<std::string> strings;
    while (end != str.end())
    {
        strings.emplace_back(it, end);
        ++end;
        ++it;
    }
    // have to do this to get the last string since end == str.end()
    strings.emplace_back(it, end);
    return strings;
}

Live Example

NathanOliver
  • 150,499
  • 26
  • 240
  • 331
1

The compiler will turn your f function into the fastest possible code if you change to copying by reference: CharacterVector f(const std::string& str, const int n)


While you won't see speed improvements, you could definitely simplify your process by doing away with CharacterVector and just using a vector<string>:

const string str("atgctgttg");
const int n = 5; // Assumed positive number smaller than str.size()
const int n1 = n - 1;
vector<string> result(str.size() - n1);

transform(str.cbegin(), str.cend() - n1, result.begin(), [n](const auto& i) {return string(&i, n);});

[Live Example]


One way you could see speed improvements is if you could use array instead of string:

const string str("atgctgttg");
const int n1 = N - 1;
vector<array<char, N>> result(str.size() - n1);

transform(str.cbegin(), str.cend() - n1, result.begin(), [](const auto& i) {
    array<char, N> result;

    copy_n(&i, N, result.begin());
    return result;
});

[Live Example]


But by far the fastest (and best) way to do this is just work on the original string and not break this into an array of strings. This requires a bit more work on the backend, because you'll need to work with c-strings instead of std::strings. For example, I've used for (auto& i : result) cout << string(i.data(), N) << endl; to print all my vectors, but if you didn't use a vector you could print like: for (auto i = str.cbegin(); i != str.cend() - n1; ++i) printf("%.*s\n", n, &*i); Obviously a bit more work, but if your str is large you'll find it much faster.

[Live example]

Jonathan Mee
  • 35,107
  • 16
  • 95
  • 241
  • About the &, Rcpp manage the reference set up of function automatically.if you inserted variable by reference it will more complicated and the time run of the code increase – Mohamed Ezzeddine Macherki Jan 12 '16 at 17:43