1

Since I read that str_word_count is flawed, I searched for an alternate solution and came across the following, which works quite good in general except of one problem.

function count_words($text) {

    //it removes html tags
    $text = preg_replace('/<[^>]*>/', '', $text);

    //it removes html space code
    $text = preg_replace(array('/&nbsp;/'), ' ', $text);

    //it removes multiple spaces with single
    $text = trim(preg_replace('!\s+!', ' ', $text));

    return count(explode(' ', $text));
}

The problem is that it detects a dash "-" as a word.

Example:

This is a title - Additional Info

It will count 7 words instead of 6.

Is there a possibility to exclude single characters like - from this word count?

DanielM
  • 189
  • 8
  • Curious where you read that `str_word_count` is flawed. – Dave Mar 28 '19 at 18:34
  • I tested it myself on a larger text and it didn't give me the acurate word amount like Microsoft word for instance. And the flaws are also mentioned here https://stackoverflow.com/questions/4786802/how-to-count-the-words-in-a-specific-string-in-php – DanielM Mar 28 '19 at 18:44

1 Answers1

2

I would just count words:

$count = preg_match_all("/[\w']+/", $text);

To get the functionality of removing HTML tags and HTML entities:

$count = preg_match_all("/[\w']+/", html_entity_decode(strip_tags($text), ENT_QUOTES));

Probably better is to include what you think makes up a word. Add anything that is not covered by \w. The i makes it case-insensitive:

$count = preg_match_all("/[a-z']+/i", html_entity_decode(strip_tags($text), ENT_QUOTES));
AbraCadaver
  • 73,820
  • 7
  • 55
  • 81
  • Good, thanks! Any way to let it count "don't" as one word instead of two? – DanielM Mar 29 '19 at 07:26
  • Perfect. Is there a possibility to also exclude digits? :) The word count is for a translation, so that the human translator knows how many words there are to translate. Since digits/numbers don't need to be translated, I'd like to not count them as words. – DanielM Mar 29 '19 at 19:11
  • Sorry. I don't really understand. Using only the last line counts almost all charachters. Or is it supposed to be used in combination with the line above? If so, how? – DanielM Mar 30 '19 at 04:46
  • Sorry I deleted the `+` somehow. – AbraCadaver Mar 30 '19 at 19:18