2

I asked a similar question but left out an important detail. I am doing some text processing on an array of chars (cstring). The input array is copied to the output array except certain characters get changed (e.g. a->b). This is done by using a switch statements. What I want is if two or more of a certain characters are found in a row only one of them gets copied to the new array (so I wouldn't want two spaces in a row).

This is what I've got so far and it works without skipping the two or more in a row of certain characters:

char cleanName[ent->d_namlen];
    for(int i = 0; i < ent->d_namlen; i++)
    {
        switch(ent->d_name[i])
        {

            case 'a' :
                cleanName[i] = 'b';//replace a's with b's (just an example)
                break;
            case ' ' ://fall through
            case '-' :
            case '–' :
            case '—' :
                cleanName[i] = '_';//replace spaces and dashes with spaces
                break;
            ....//more case statments
           default: 
                cleanName[i] = ent->d_name[i];
  }
}

For example if two characters in a row get replaced by underscores, how do I do this? Would I only execute the switch statement if(ent->d_name[i] != previous || (ent->d_name[i] != '-' && ent->d_name[i] != '_' && ent->d_name[i] != ' ')

This may be more of an algorithm question than an implementation specific one.

Example input: abbc--d-e
Output: bbbc_d_e (for simplicity sake assume 'a' is mapped to 'b' but really there is more than this)

Celeritas
  • 12,953
  • 32
  • 95
  • 174

5 Answers5

2

Well, for such class of text processing algorithms I would use a state machine.

Vlad
  • 33,616
  • 5
  • 74
  • 185
2

Easiest would be to use std::unique with a custom predicate, after your existing transformation:

cleanName.erase(std::unique(std::begin(cleanName), std::end(cleanName),
    [](char c, char d) { return c == '_' && d == '_'; }), std::end(cleanName));

For a char array:

length = std::unique(cleanName, &cleanName[length],
    [](char c, char d) { return c == '_' && d == '_'; }) - cleanName;
cleanName[length] = '\0';
ecatmur
  • 137,771
  • 23
  • 263
  • 343
  • can I do it even though this is a cstring not a string? – Celeritas Aug 31 '12 at 19:55
  • @Celeritas absolutely, the whole point of C++ algorithms is that they work on any container type, including arrays. – ecatmur Aug 31 '12 at 19:56
  • but cleanName is just an array of chars, I don't think erase is defined for it? – Celeritas Aug 31 '12 at 20:00
  • @Celeritas no, you'd just update the `length` and then set a null terminator. See my update above. – ecatmur Aug 31 '12 at 20:01
  • @Celeritas: In that case the real question is: why are you using cstrings? – Grizzly Aug 31 '12 at 20:02
  • @Grizzly good question. Because that's what the `struct dirent` contains. If you know of anyway around this please share. – Celeritas Aug 31 '12 at 21:12
  • @ecatmur could you please explain what the second and third parameters in std::unique are doing? I don't understand why there's an & in front of cleanName[length] is that supposed to point to the end of the array? In the third parameter I don't know the syntax [](){} – Celeritas Aug 31 '12 at 21:45
  • @Celeritas yes, that's the end of the array. The third parameter is a lambda; see http://stackoverflow.com/questions/7627098/what-is-a-lambda-expression-in-c11 – ecatmur Aug 31 '12 at 21:53
0

One thing to note is that with the way your indexing works you need to realize that the length of the cleanName isn't necessarily going to be the same length as your input array. As such you need to becareful with the index i.

ajon
  • 6,725
  • 10
  • 43
  • 81
0

My suggestion is keeping the last found character in a temp variable. This way you can ignore the same character if it appears. There are two ways of doing this:
Adding a while-statement after the switch, which consumes every character equal to the last one found. This is useful if the cstring has a lot of repeated characters:

 char cleanName[ent->d_namlen];
 char parent;
    for(int i = 0; i < ent->d_namlen; i++)
    {
        switch(ent->d_name[i])
        {
            case 'a' :
                cleanName[i] = 'b';//replace a's with b's (just an example)
                parent = ent->d_name[i];
                break;
            case ' ' ://fall through
            case '-' :
            case '–' :
            case '—' :
                cleanName[i] = '_';//replace spaces and dashes with spaces
                parent = ent->d_name[i];
                break;
            ....//more case statments
           default: 
                cleanName[i] = ent->d_name[i];
        }
        while((parent == ent->d_name[i++]) && ent->d_name[i++] != NULL)
            i++;
    }

The only issue here is that if you have a sequence of various interleaved spaces and dashes, it may not recognize it and keep the various "_". One solution would be to keep a collection of parents instead of a single char.

The other way is to compare the present character with the next in each iteration. You keep the "parent" variable and in each switch case you compare the current character with the parent and don't add it to the cleanName cstring if they're the same.

jcd
  • 300
  • 2
  • 13
0

Here's an idea with unique_copy algorithm and transform_iterator from Boost:

#include <boost/iterator/transform_iterator.hpp>
#include <iterator>
#include <iostream>
#include <string>
#include <algorithm>

char transform(char c)
{
    switch(c) {
        case 'a' :
            return 'b';
        case ' ' :
        case '-' :
        case '–' :
        case '—' :
            return '_';
        default:
            return c;
    }
}

int main()
{
    std::string in = "abbc--d-e";
    std::string out;

    std::unique_copy(
        boost::make_transform_iterator(in.begin(), transform),
        boost::make_transform_iterator(in.end(), transform),
        std::back_inserter(out),
        [](char c1, char c2){ return c1 == '_' && c2 == '_'; });

    std::cout << out; // bbbc_d_e
}
jrok
  • 51,107
  • 8
  • 99
  • 136