10

This comment suggests that there is a O(n) alternative to my O(n log n) solution to this problem:

Given string str("helloWorld") the expected output is:

l = 3
o = 2

My solution was to do this:

sort(begin(str), end(str));

for(auto start = adjacent_find(cbegin(str), cend(str)), finish = upper_bound(start, cend(str), *start); start != cend(str); start = adjacent_find(finish, cend(str)), finish = upper_bound(start, cend(str), *start)) {
   cout << *start << " = " << distance(start, finish) << endl;
}

Which is obviously limited by the sorting of str. I think this would require a bucket sort solution? Is there anything more clever that I'm missing?

Jonathan Mee
  • 35,107
  • 16
  • 95
  • 241
  • 6
    Build a hash-table from characters to counters, and increment counters for each characters in the string. This is O(n). Then, iterate over all entries where the value is greater than 1 in the table to find characters which appear more than once (this is O(m), with m <= n). – coredump Jan 02 '18 at 16:11
  • @coredump I think you're suggesting an [`unordered_multiset`](http://en.cppreference.com/w/cpp/container/unordered_multiset/unordered_multiset) which has quadratic construction time :( – Jonathan Mee Jan 02 '18 at 16:16
  • @JonathanMee -- Why a multiset? All you need is a map of characters to count. A `std::unordered_map` fits. – PaulMcKenzie Jan 02 '18 at 16:18
  • No, more like a http://en.cppreference.com/w/cpp/container/unordered_map – coredump Jan 02 '18 at 16:19
  • @coredump That also has quadratic construction time. – Jonathan Mee Jan 02 '18 at 16:20
  • 6
    What's wrong with the old-fashioned `int storage[CHAR_MAX - CHAR_MIN + 1];` ? – Bathsheba Jan 02 '18 at 16:20
  • @PaulMcKenzie `unordered_multiset` will populate without the need for me to populate it separately, and since size isn't an issue here that seemed easier. – Jonathan Mee Jan 02 '18 at 16:24
  • The map is empty at first, so that would be one of the constant constructors (marked (1)) from http://en.cppreference.com/w/cpp/container/unordered_map/unordered_map – coredump Jan 02 '18 at 16:24
  • 1
    @Bathsheba Derp. I don't know what I'm thinking. Can you just type that up, and I'll accept? – Jonathan Mee Jan 02 '18 at 16:25
  • @JonathanMee: Done. Although I tie myself up in knots a little over the whole `- CHAR_MIN` thing. – Bathsheba Jan 02 '18 at 16:41
  • 1
    @Bathsheba Nothing wrong as long as bytes and chars are the same thing :-) – coredump Jan 02 '18 at 16:54
  • @Bathsheba I see, thanks. I was more thinking about UTF-8, etc. but since the question is about `std::string`, it is fair to use an array of chars. – coredump Jan 02 '18 at 17:01
  • 1
    @Bathsheba: You mostly replaced `std::sort` by [counting_sort](https://en.wikipedia.org/wiki/Counting_sort). – Jarod42 Jan 03 '18 at 00:41

2 Answers2

13

Here's one way, which is O(N) at the expense of maintaining storage for every possible char value.

#include <string>
#include <limits.h> // for CHAR_MIN and CHAR_MAX. Old habits die hard.

int main()
{
    std::string s("Hello World");        
    int storage[CHAR_MAX - CHAR_MIN + 1] = {};
    for (auto c : s){
        ++storage[c - CHAR_MIN];
    }

    for (int c = CHAR_MIN; c <= CHAR_MAX; ++c){
        if (storage[c - CHAR_MIN] > 1){
            std::cout << (char)c << " " << storage[c - CHAR_MIN] << "\n";
        }
    }    
}

This portable solution is complicated by the fact that char can be signed or unsigned.

Bathsheba
  • 220,365
  • 33
  • 331
  • 451
  • Use `numeric_limits`: https://stackoverflow.com/questions/48063162/stderase-crash-when-count-duplicate-char-in-a-loop/48063390#comment83102104_48063390 that should make it portable. – Jonathan Mee Jan 02 '18 at 16:35
  • What does `int storage[CHAR_MAX - CHAR_MIN + 1] = {};` help with exactly? – Khalil Khalaf Jan 02 '18 at 16:35
  • 6
    Sets up an array of appropriate size and initialises every element to 0. The C guys still have to write `{0}`. Pity them! – Bathsheba Jan 02 '18 at 16:35
3

Here is what @bathsheba mentioned and with improvements by @Holt:

#include <string>
#include <climits>
#include <iostream>

void show_dup(const std::string& str) {
    const int sz = CHAR_MAX - CHAR_MIN + 1;
    int all_chars[sz] = { 0 };
    // O(N), N - the length of input string
    for(char c : str) {
        int idx = (int)c;
        all_chars[idx]++;
    }
    // O(sz) - constant. For ASCII char it will be 256
    for(int i = 0; i < sz; i++) {
        if (all_chars[i] > 1) {
            std::cout << (char)i << " = " << all_chars[i] << std::endl;
        }
    }
}

int main()
{
  std::string str("helloWorld");

  show_dup(str);
}
Artavazd Balayan
  • 2,163
  • 1
  • 13
  • 22
  • Would be better if this was valid c++ (`int all_chars[sz]`) without a `using namespace std;` on top. – Holt Jan 02 '18 at 16:32
  • @Holt, May I ask what's wrong with my C++ code? It compiles and there are no warnings in pedantic mode. – Artavazd Balayan Jan 02 '18 at 16:34
  • 2
    Variable-length arrays are not standard C++, it's a C99 features or a compiler-specific extension. You get warning from both gcc and clang for this, e.g. *"warning: ISO C++ forbids variable length array 'all_chars' [-Wvla]"* from gcc. – Holt Jan 02 '18 at 16:36
  • 1
    @Holt, note the way I have it is a compile-time evaluable constant expression. – Bathsheba Jan 02 '18 at 16:38
  • @Bathsheba In your code yes, not here, unless you add a `constexpr` qualifier to `sz`. – Holt Jan 02 '18 at 16:39
  • @Holt: Absolutely! – Bathsheba Jan 02 '18 at 16:39
  • @Holt, seems you're right, thanks. But I was coding in online editor, http://cpp.sh/3n4dc, in C++ 14 mode. And GCC did not complain about anything in pedantic mode. – Artavazd Balayan Jan 02 '18 at 16:46