Can the Duplicate Characters in a string be Identified and Quantified in O(n)?

Question

This comment suggests that there is a O(n) alternative to my O(n log n) solution to this problem:

Given string str("helloWorld") the expected output is:

l = 3
o = 2

My solution was to do this:

sort(begin(str), end(str));

for(auto start = adjacent_find(cbegin(str), cend(str)), finish = upper_bound(start, cend(str), *start); start != cend(str); start = adjacent_find(finish, cend(str)), finish = upper_bound(start, cend(str), *start)) {
   cout << *start << " = " << distance(start, finish) << endl;
}

Which is obviously limited by the sorting of str. I think this would require a bucket sort solution? Is there anything more clever that I'm missing?

Build a hash-table from characters to counters, and increment counters for each characters in the string. This is O(n). Then, iterate over all entries where the value is greater than 1 in the table to find characters which appear more than once (this is O(m), with m <= n). — coredump, Jan 02 '18 at 16:11
@coredump I think you're suggesting an [`unordered_multiset`](http://en.cppreference.com/w/cpp/container/unordered_multiset/unordered_multiset) which has quadratic construction time :( — Jonathan Mee, Jan 02 '18 at 16:16
@JonathanMee -- Why a multiset? All you need is a map of characters to count. A `std::unordered_map` fits. — PaulMcKenzie, Jan 02 '18 at 16:18
No, more like a http://en.cppreference.com/w/cpp/container/unordered_map — coredump, Jan 02 '18 at 16:19
What's wrong with the old-fashioned `int storage[CHAR_MAX - CHAR_MIN + 1];` ? — Bathsheba, Jan 02 '18 at 16:20
@PaulMcKenzie `unordered_multiset` will populate without the need for me to populate it separately, and since size isn't an issue here that seemed easier. — Jonathan Mee, Jan 02 '18 at 16:24
The map is empty at first, so that would be one of the constant constructors (marked (1)) from http://en.cppreference.com/w/cpp/container/unordered_map/unordered_map — coredump, Jan 02 '18 at 16:24
@Bathsheba Derp. I don't know what I'm thinking. Can you just type that up, and I'll accept? — Jonathan Mee, Jan 02 '18 at 16:25
@JonathanMee: Done. Although I tie myself up in knots a little over the whole `- CHAR_MIN` thing. — Bathsheba, Jan 02 '18 at 16:41
@Bathsheba Nothing wrong as long as bytes and chars are the same thing :-) — coredump, Jan 02 '18 at 16:54
@Bathsheba I see, thanks. I was more thinking about UTF-8, etc. but since the question is about `std::string`, it is fair to use an array of chars. — coredump, Jan 02 '18 at 17:01
@Bathsheba: You mostly replaced `std::sort` by [counting_sort](https://en.wikipedia.org/wiki/Counting_sort). — Jarod42, Jan 03 '18 at 00:41

Bathsheba · Accepted Answer · 2018-01-02T16:36:11.627

13

Here's one way, which is O(N) at the expense of maintaining storage for every possible char value.

#include <string>
#include <limits.h> // for CHAR_MIN and CHAR_MAX. Old habits die hard.

int main()
{
    std::string s("Hello World");        
    int storage[CHAR_MAX - CHAR_MIN + 1] = {};
    for (auto c : s){
        ++storage[c - CHAR_MIN];
    }

    for (int c = CHAR_MIN; c <= CHAR_MAX; ++c){
        if (storage[c - CHAR_MIN] > 1){
            std::cout << (char)c << " " << storage[c - CHAR_MIN] << "\n";
        }
    }    
}

This portable solution is complicated by the fact that char can be signed or unsigned.

edited Jan 02 '18 at 16:36

answered Jan 02 '18 at 16:33

Bathsheba

220,365
33
331
451

Use `numeric_limits`: https://stackoverflow.com/questions/48063162/stderase-crash-when-count-duplicate-char-in-a-loop/48063390#comment83102104_48063390 that should make it portable. – Jonathan Mee Jan 02 '18 at 16:35
What does `int storage[CHAR_MAX - CHAR_MIN + 1] = {};` help with exactly? – Khalil Khalaf Jan 02 '18 at 16:35
6

Sets up an array of appropriate size and initialises every element to 0. The C guys still have to write `{0}`. Pity them! – Bathsheba Jan 02 '18 at 16:35

Artavazd Balayan · Answer 2 · 2018-01-03T05:07:02.690

3

Here is what @bathsheba mentioned and with improvements by @Holt:

#include <string>
#include <climits>
#include <iostream>

void show_dup(const std::string& str) {
    const int sz = CHAR_MAX - CHAR_MIN + 1;
    int all_chars[sz] = { 0 };
    // O(N), N - the length of input string
    for(char c : str) {
        int idx = (int)c;
        all_chars[idx]++;
    }
    // O(sz) - constant. For ASCII char it will be 256
    for(int i = 0; i < sz; i++) {
        if (all_chars[i] > 1) {
            std::cout << (char)i << " = " << all_chars[i] << std::endl;
        }
    }
}

int main()
{
  std::string str("helloWorld");

  show_dup(str);
}

edited Jan 03 '18 at 05:07

answered Jan 02 '18 at 16:31

Artavazd Balayan

2,163
1
13
22

Would be better if this was valid c++ (`int all_chars[sz]`) without a `using namespace std;` on top. – Holt Jan 02 '18 at 16:32
@Holt, May I ask what's wrong with my C++ code? It compiles and there are no warnings in pedantic mode. – Artavazd Balayan Jan 02 '18 at 16:34
2

Variable-length arrays are not standard C++, it's a C99 features or a compiler-specific extension. You get warning from both gcc and clang for this, e.g. *"warning: ISO C++ forbids variable length array 'all_chars' [-Wvla]"* from gcc. – Holt Jan 02 '18 at 16:36
1

@Holt, note the way I have it is a compile-time evaluable constant expression. – Bathsheba Jan 02 '18 at 16:38
@Bathsheba In your code yes, not here, unless you add a `constexpr` qualifier to `sz`. – Holt Jan 02 '18 at 16:39
@Holt: Absolutely! – Bathsheba Jan 02 '18 at 16:39
@Holt, seems you're right, thanks. But I was coding in online editor, http://cpp.sh/3n4dc, in C++ 14 mode. And GCC did not complain about anything in pedantic mode. – Artavazd Balayan Jan 02 '18 at 16:46

Can the Duplicate Characters in a string be Identified and Quantified in O(n)?

2 Answers2

Linked