Remove stop words from searchstring in PHP

Question

I am facing problems with a php functin for optimizing a search string for a mssql query.

I need to find an entry which look like 'hobbit, the' by searching for 'the hobbit'. I thought about cutting the articles (in germany we have 'der', 'die' and 'das') if they have a trailing space out of the search string.

my function looks like:

      public function optimizeSearchString($searchString)
      {
        $articles = [
          'der ',
          'die ',
          'das ',
          'the '
        ];


        foreach ($articles as $article) {
//only cut $article out of $searchString if its longer than the $article itself
          if (strlen($searchString) > strlen($article) && strpos($searchString, $article)) {
            $searchString = str_replace($article, '', $searchString);
            break;
          }
        }

        return $searchString;
      }

but this doesn't work...

Maybe there is a nicer solution using regular expressions?

This test `strlen($searchString) > strlen($article)` is totally useless, remove it. `strpos` may return 0 that is interpreted as false. You must write `strpos(...) !== false`. Instead of making tests, replace directly. in this way you parse the string only once. — Casimir et Hippolyte, Aug 31 '15 at 12:04
The advantage of using `preg_replace` here is to avoid false positive using word boundaries to delimit words, and to remove all of them in one pass using an alternation. The pattern is not difficult, a quick regex tutorial will solve the problem. — Casimir et Hippolyte, Aug 31 '15 at 12:09
I tried `$optimizedString = preg_replace("/(der\s|die\s|das\s|the\s)/", '', $searchString);` but this seems not to work... — bambamboole, Aug 31 '15 at 12:38
"not to work" is not an information, what happens (error message, same string, other)? — Casimir et Hippolyte, Aug 31 '15 at 12:39
i try to search for the string 'der hobbit' and it should search for 'hobbit' because the part 'der ' should be replaced with the empty string '' but the result is empty. If I search for 'hobbit' there are results — bambamboole, Aug 31 '15 at 12:43
@bambamboole See [an idea with splitting](https://eval.in/425363) and removing *stopwords*. — Jonny 5, Aug 31 '15 at 12:44
Take care to return `$optimizedString` and not `$searchString` — Casimir et Hippolyte, Aug 31 '15 at 12:45
[Its working for me](https://eval.in/425367). Can you be more specific for your errors and inputs along with expected outputs — Narendrasingh Sisodia, Aug 31 '15 at 12:48
Ok, all solutions seems to work, but I don't know which part of the whole legacy code breaks it. — bambamboole, Aug 31 '15 at 12:55

score 4 · Answer 1 · edited May 23 '17 at 10:31

1.) To just remove one stopword from start or end of the string by using regex like this:

~^\W*(der|die|das|the)\W+\b|\b\W+(?1)\W*$~i

~ is the pattern delimiter
^ the caret anchor matches start of the string
\W (upper) is a short for a character, that is not a word character
(der|die|das|the) alternation | in first parenthesized group
\b matches a word boundary
At (?1) the pattern of first group is pasted
$ matches right after the last character in the string
Used i (PCRE_CASELESS) flag. If input is utf-8, also need u (PCRE_UTF8) flag.

Reference - What does this regex mean

Generate the pattern:

// array containing stopwords
$stopwords = array("der", "die", "das", "the");

// escape the stopword array and implode with pipe
$s = '~^\W*('.implode("|", array_map("preg_quote", $stopwords)).')\W+\b|\b\W+(?1)\W*$~i';

// replace with emptystring
$searchString = preg_replace($s, "", $searchString);

Note that if ~ delimiter occurs in the $stopwords array, it also has to be escaped with a backslash.

PHP test at eval.in, Regex pattern at regex101

2.) But to remove stop words anywhere in the string how about splitting into words:

// words to be removed
$stopwords = array(
'der' => 1,
'die' => 1,
'das' => 1,
'the' => 1);
# used words as key for better performance

// remove stopwords from string
function strip_stopwords($str = "")
{
  global $stopwords;

  // 1.) break string into words
  // [^-\w\'] matches characters, that are not [0-9a-zA-Z_-']
  // if input is unicode/utf-8, the u flag is needed: /pattern/u
  $words = preg_split('/[^-\w\']+/', $str, -1, PREG_SPLIT_NO_EMPTY);

  // 2.) if we have at least 2 words, remove stopwords
  if(count($words) > 1)
  {
    $words = array_filter($words, function ($w) use (&$stopwords) {
      return !isset($stopwords[strtolower($w)]);
      # if utf-8: mb_strtolower($w, "utf-8")
    });
  }

  // check if not too much was removed such as "the the" would return empty
  if(!empty($words))
    return implode(" ", $words);
  return $str;
}

See demo at eval.in, ideone.com

// test it
echo strip_stopwords("The Hobbit das foo, der");

Hobbit foo

This solution will also remove any punctuation besides _ - ' because it's imploding remaining words with space after removing the common words. The idea is to prepare the string for a query.

Both solutions don't modify the case and will leave the string if it only consists of one stopword.

Lists of common words

Most common words in English ^Wikipedia
Most frequent words in German language ^Wikipedia
MySQL: English full-text stopwords
Default English stopwords list
List of German stopwords

Could you please explain why you are passing `$stopwords` by reference in the `array_filter` closure in your second code? I ask because of [this](http://stackoverflow.com/a/3845530/4946451) post about value vs. reference performance. Wouldn't it be better to pass by value here? — arkuuu, Apr 25 '17 at 07:56

score 3 · Accepted Answer · answered Aug 31 '15 at 13:18

The solution provided by @Jonny 5 seems to be the best for my solution.

Now I use a function like this:

  public function optimizeSearchString($searchString = "")
  {
    $stopwords = array(
      'der' => 1,
      'die' => 1,
      'das' => 1,
      'the' => 1);

    $words = preg_split('/[^-\w\']+/', $searchString, -1, PREG_SPLIT_NO_EMPTY);

    if (count($words) > 1) {
      $words = array_filter($words, function ($v) use (&$stopwords) {
        return !isset($stopwords[strtolower($v)]);
      }
      );
    }

    if (empty($words)) {
      return $searchString;
    }

    return implode(" ", $words);
  }

The new solution of Jonny 5 would work also, but I use this one, because I'm not that familiar with regex and I know whats going on :-)

Great that helped! I also posted an answer with another solution :] — Jonny 5, Aug 31 '15 at 13:21

score 2 · Answer 3 · answered Dec 08 '16 at 10:17

2

This is what i do.

public function optimizeSearchString($searchString) {
    $wordsFromSearchString = str_word_count($searchString, true);
    $finalWords = array_diff($wordsFromSearchString, $stopwords);
    return implode(" ", $finalWords);
}

answered Dec 08 '16 at 10:17

Yashrajsinh Jadeja

1,421
1
15
17

Victor Stoddard · Answer 4 · 2019-04-16T16:01:06.440

I made a different version using array_diff, which @Yashrajsinh Jadeja also did. I added a third parameter 'strcasecmp' to ignore case and made the input an array using a simple word tokenizer.

//Search string with article
$searchString = "Das blaue Haus"; //"The blue house"

//Split string into array. (This method is insufficient and doesn't account for compound nouns like "blue jay" or "einfamilienhaus".)
$wordArray = preg_split('/[^-\w\']+/', $searchString, -1, PREG_SPLIT_NO_EMPTY); 

var_dump(optimizeSearchString($wordArray));

function optimizeSearchString($wordArray) {
  $articles = array('der', 'die', 'das', 'the');
  $newArray = array_udiff($wordArray, $articles, 'strcasecmp');
  return $newArray;
}

Output:

array(2) {
  [1]=>
  string(5) "blaue"
  [2]=>
  string(4) "Haus"
}

score 0 · Answer 5 · edited Apr 15 '21 at 05:23

0

public function optimizeSearchString($searchString)
{
        $articles = (
          'der ',
          'die ',
          'das ',
          'the '
        );


        foreach ($articles as $article) {
         //only cut $article out of $searchString if its longer than the $article itself
          if (strlen($searchString) > strlen($article) && strpos($searchString, $article)) {
            $searchString = str_replace($article, '', $searchString);
            break;
          }
        }

        return $searchString;
}

edited Apr 15 '21 at 05:23

a.ak

386
4
15

answered Apr 15 '21 at 04:38

user15642852

1

1

It looks like you just copied the question code and reposted it. At a minimum, you should provide context as to what this code does differently or how it answers the question. – Kat Apr 15 '21 at 05:31

Remove stop words from searchstring in PHP

5 Answers5