PHP performant search a text for given usernames

Question

I am currently dealing with a performance issue where I cannot find a way to fix it. I want to search a text for usernames mentioned with the @ sign in front. The list of usernames is available as PHP array.

The problem is usernames may contain spaces or other special characters. There is no limitation for it. So I can't find a regex dealing with that. Currently I am using a function which gets the whole line after the @ and checks char by char which usernames could match for this mention, until there is just one username left which totally matches the mention. But for a long text with 5 mentions it takes several seconds (!!!) to finish. for more than 20 mentions the script runs endlessly.

I have some ideas, but I don't know if they may work.

Going through username list (could be >1.000 names or more) and search for all @Username without regex, just string search. I would say this would be far more inefficient.
Checking on writing the usernames with JavaScript if space or punctual sign is inside the username and then surround it with quotation marks. Like @"User Name". Don't like that idea, that looks dirty for the user.
Don't start with one character, but maybe 4. and if no match, go back. So same principle like on sorting algorithms. Divide and Conquer. Could be difficult to implement and will maybe lead to nothing.

How does Facebook or twitter and any other site do this? Are they parsing the text directly while typing and saving the mentioned usernames directly in the stored text of the message?

This is my current function:

$regular_expression_match = '#(?:^|\\s)@(.+?)(?:\n|$)#';
$matches = false;
$offset = 0;

while (preg_match($regular_expression_match, $post_text, $matches, PREG_OFFSET_CAPTURE, $offset))
{
    $line = $matches[1][0];
    $search_string = substr($line, 0, 1);
    $filtered_usernames = array_keys($user_list);
    $matched_username = false;

    // Loop, make the search string one by one char longer and see if we have still usernames matching
    while (count($filtered_usernames) > 1)
    {
        $filtered_usernames = array_filter($filtered_usernames, function ($username_clean) use ($search_string, &$matched_username) {
            $search_string = utf8_clean_string($search_string);

            if (strlen($username_clean) == strlen($search_string))
            {
                if ($username_clean == $search_string)
                {
                    $matched_username = $username_clean;
                }
                return false;
            }

            return (substr($username_clean, 0, strlen($search_string)) == $search_string);
        });

        if ($search_string == $line)
        {
            // We have reached the end of the line, so stop
            break;
        }
        $search_string = substr($line, 0, strlen($search_string) + 1);
    }

    //  If there is still one in filter, we check if it is matching
    $first_username = reset($filtered_usernames);
    if (count($filtered_usernames) == 1 && utf8_clean_string(substr($line, 0, strlen($first_username))) == $first_username)
    {
        $matched_username = $first_username;
    }

    // We can assume that $matched_username is the longest matching username we have found due to iteration with growing search_string
    // So we use it now as the only match (Even if there are maybe shorter usernames matching too. But this is nothing we can solve here,
    // This needs to be handled by the user, honestly. There is a autocomplete popup which tells the other, longer fitting name if the user is still typing,
    // and if he continues to enter the full name, I think it is okay to choose the longer name as the chosen one.)
    if ($matched_username)
    {
        $startpos = $matches[1][1];

        // We need to get the endpos, cause the username is cleaned and the real string might be longer
        $full_username = substr($post_text, $startpos, strlen($matched_username));
        while (utf8_clean_string($full_username) != $matched_username)
        {
            $full_username = substr($post_text, $startpos, strlen($full_username) + 1);
        }

        $length = strlen($full_username);
        $user_data = $user_list[$matched_username];

        $mentioned[] = array_merge($user_data, array(
            'type'          => self::MENTION_AT,
            'start'         => $startpos,
            'length'        => $length,
        ));
    }

    $offset = $matches[0][1] + strlen($search_string);
}

Which way would you go? The problem is the text will be displayed often and parsing it every time will consume a lot of time, but I don't want to heavily modify what the user had entered as text.

I can't find out what's the best way, and even why my function is so time consuming.

A sample text would be:

Okay, @Firstname Lastname, I mention you! Listen @[TEAM] John, you are a team member. @Test is a normal name, but @Thât♥ should be tracked too. And see @Wolfs garden! I just mean the Wolf.

Usernames in that text would be

Firstname Lastname
[TEAM] John
Test
Thât♥
Wolf

So yes, there is clearly nothing I know where a name may end. Only thing is the newline.

@HardikPatel: Yes, the usernames and also the text (post_text) are coming from the database. — Wolfsblvt, Jan 30 '15 at 12:42
You have a text that could contain usernames such as `@user1: Lorem ipsum dolor sit amet @user2 asdfasdf...` and now you want to extract those `@...` names and check if they exist in user array is that right? If text contains few usernames and array many, extract them from text using regex and lookup in array, where username is the key `if(isset($user[$user_search])) { do it; }` that's fast. If the userlist is huge, possibly best to lookup in db and not load all users into an array. — Jonny 5, Jan 30 '15 at 12:49
@Jonny5: Yes, that is all correct. Except for one problem. What should I extract with the regex? Usernames can contain spaces or dots or any other chars except quotation marks. Just take this message of mine here. If I should stop the string at the ":" after your name, what is with users like "iam:cool" for example? I don't know wich chars the username might contains, so I can't easily extract all mentions in the text. — Wolfsblvt, Jan 30 '15 at 12:54
You can't really match `Firstname Lastname` as it has a space in it - how would you differentiate the second word from an ordinary one? See Stack Overflow: usernames here can have spaces, but the `@` form in tab completion does not - spaces are stripped. That's a good compromise, I think. — halfer, Jan 30 '15 at 14:49
@halfer: That would be okay, but "FirstnameLastname" is a valid username too. So who would be mentioned when writing `@FirstnameLastname`? — Wolfsblvt, Jan 30 '15 at 15:04

score 2 · Answer 1 · edited May 23 '17 at 12:28

2

I think the main problem is, that you can't distinguish usernames from text and it's a bad idea, to lookup maybe thousands of usernames in a text, also this can lead to further problems, that John is part of [TEAM] John‌ or JohnFoo...

It's needed to separate the usernames from other text. Assuming that you're using UTF-8, could put the usernames inside invisible zero-w space \xE2\x80\x8B and non-joiner \xE2\x80\x8C.

The usernames can now be extracted fast and with little effort and if needed still verified in db.

$txt = "
Okay, @\xE2\x80\x8BFirstname Lastname\xE2\x80\x8C, I mention you!
Listen @\xE2\x80\x8B[TEAM] John\xE2\x80\x8C, you are a team member.
@\xE2\x80\x8BTest\xE2\x80\x8C is a normal name, but 
@\xE2\x80\x8BThât?\xE2\x80\x8C should be tracked too.
And see @\xE2\x80\x8BWolfs\xE2\x80\x8C garden! I just mean the Wolf.";

// extract usernames
if(preg_match_all('~@\xE2\x80\x8B\K.*?(?=\xE2\x80\x8C)~s', $txt, $out)){
  print_r($out[0]);
}

Array ( [0] => Firstname Lastname 1 => [TEAM] John 2 => Test 3 => Thât♥ 4 => Wolfs )

echo $txt;

Okay, @Firstname Lastname, I mention you!
Listen @[TEAM] John‌, you are a team member.
@Test‌ is a normal name, but 
@Thât♥‌ should be tracked too.
And see @Wolfs‌ garden! I just mean the Wolf.

Could use any characters you like and that possibly don't occur elsewhere for separation.

Regex FAQ, Test at eval.in (link will expire soon)

edited May 23 '17 at 12:28

Community

1
1

answered Jan 30 '15 at 13:42

Jonny 5

11,051
2
20
42

This is a neat idea. Have thought about delimiters, but invisible ones are a great idea. That would fix all the problems I have with finding the usernames in the text. Only problem I see there is how to add the unicode characters to the mention when user types it. I am using a jquery library for autocomplete feature, it would be possible to insert the inserted name with character, but what if the poster just copy'n'pastes text? No one would know why it will not match. Or should I search the text on submit for @ not followed by the invisible char and add it? That may work.. I have to test that! – Wolfsblvt Jan 30 '15 at 13:57
@Wolfsblvt I don't know about your system but the delimiters should not be stored in db, only used in the text for separation I think. Also you could put the starting `\xE2\x80\x8B` before `@` and change the regex. All up to you :) – Jonny 5 Jan 30 '15 at 14:06
This is for an extension of phpBB. So users are mentioned in a post. The post is stored in the database. And if the post is being displayed, mentions should be highlighted and colored in the right username color. So I have to store it in the database with the delimiter, cause I need to find the names when displaying the post. – Wolfsblvt Jan 30 '15 at 14:09
Ah, and may I ask you to remove my reallife name from the comment? I haven't thought about it that it may be a worse idea to have it forever on this post :-/ – Wolfsblvt Jan 30 '15 at 14:09
Okay, so I have to find a way to get those chars around the username on posting. So for using the autocomplete it works. And if copying or typing manually, I have to do additional parsing for non-surrounded names. Maybe the same way like I am doing now, but with something like `/(?:^|\\s)@[^\xE2\x80\x8B](.+?)(?:\n|$)/` and then sorround them. I'll test that in the next days and if it works you'll get your deserved "accepted answer" (: Thank you for your help. – Wolfsblvt Jan 30 '15 at 14:20
@Wolfsblvt The idea is that you need unique usernames or delimiters to separate users from text. Possibly the best way would be unique usernames such as only allowed `@u_\w+` but that's obviously too late or also not perfect :) Your regex for the non-surrounded: Mean something like `\b@[^\xE2\x80\x8B](.+?)\b` I'm not sure. There should not be any other `@...` besides emails in txt anymore if using that. I hope you find a solution for your needs! Your're welcome. – Jonny 5 Jan 30 '15 at 14:31

PHP performant search a text for given usernames

1 Answers1