First, there's a problem with your function signature:
CharacterVector f( const std::string str, const int n )
You're passing the string
by value, in every call of the function there will be a copy a the string (unless you are passing movable strings using C++11). It's better to pass the string by const reference const std::string& str
.
With respect to the question, there's two possible answer, that came to mind.
- Return in fact copies of the characters of the inputs string. In this case iterating the string by index and inserting a new string in the structure as in code example 1 should be fast (the faster possible is only 1 copy, the copy of the substring to the structure).
- Return a structure of pointer to the real string. Ex: return proxy object that contain (start,end) of the substring in the string. The advantage would be that is not copy of string. Ex:
Code (tested: GCC 4.9.2 with C++11)
#include <iostream>
#include <vector>
struct string_ref {
const char* start;
const char* end;
};
// [[Rcpp::export]]
std::vector<string_ref> f(std::string&&, const int) = delete; // disallow calls with temporaries
// [[Rcpp::export]]
std::vector<string_ref> f(const std::string& str, const int n) {
int lim = str.length() - n + 1;
std::vector<string_ref> result(lim);
for (int j = 0; j < lim; j++) {
result[j] = { &str[j], &str[j + n] };
}
return result;
}
int main() {
std::string input{"atgctgttg"};
auto result = f(input, 5);
for (const auto r : result) {
std::cout << std::string(r.start, r.end) << std::endl;
}
return 0;
}
This method is used by many libraries that parse text (ex: lexers, regex engines, etc...). There is a proposed type std::string_view for the C++17, to reference partial or all string characters.
According to the comment in the code, you are implementing the function to use in R (don't known exactly), in this case this second solution probable could bring problems with memory access (the input string memory need to be accessible and live, when using the substring pointers). If the input string is created in R and call to F
, is probable that the returning pointer would be valid, the better proof is to tested.
Of the code 2 examples in the question. The first would be the faster, because in the second in every loop, there's an erase and push_back of a character (erasing the first character most probably required a copy of all the other characters of the string in most STL implementations), the push_back could require to expand the memory of the string in some cases.