Javascript, replace with regex, is it possible for this case?

Question

Is it possible to resolved with regex ?

I have an array of value :

var arr = ['eiusmod', 'sort', 'incididunt', 'dolor'];

And I have a string named my_html, who provide of .html()

<div data-sort="1">
<h1 data-position="1">Lorem ipsum dolor sit amet</h1>
<strong>search here : consectetur adipiscing elit, </strong>
<div>
sed do <u>eiusmod</u> tempor <mark>incididunt</mark> ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
</div>
Duis aute irure <i>dolor</i> in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>

The objective:

Add a html tag in the variable my_html for each word (in my arr) found, like that

<div data-sort="1">
<h1 data-position="1">Lorem ipsum <mark>dolor</mark> sit amet</h1>
<strong>search here : consectetur adipiscing elit, </strong>
<div>
sed do <u><mark>eiusmod</mark></u> tempor <mark>incididunt</mark> ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
</div>
Duis aute irure <i><mark>dolor</mark></i> in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>

Rules :

Don't add a tag in attributes of tag
Don't add a tag mark if there is already a mark for the word

Thanks for your help guys, Regards

*"Rules : Don't add a tag in attributes of tag"* **Boom**. You need an HTML parser. You cannot do this reliably with a single regex on an HTML string. You will get answers providing seemingly-good regex solutions. Don't be tempted. Use a parser. There's one available for your environment, guaranteed. — T.J. Crowder, Aug 09 '18 at 09:51
Please tag appropriately. This has nothing to do with [tag:preg-match]. — T.J. Crowder, Aug 09 '18 at 09:52
I knew I'd already answered a question like this :D You need to scan the text nodes for their content. — Niet the Dark Absol, Aug 09 '18 at 09:53
@NiettheDarkAbsol - Frankly, that dupetarget doesn't look correct for this question. Starting and ending points are quite different. — T.J. Crowder, Aug 09 '18 at 09:55
Related: https://stackoverflow.com/questions/49794417/javascript-remove-html-tags-modify-tags-text-and-insert-tags-back-in — T.J. Crowder, Aug 09 '18 at 10:00

T.J. Crowder · Accepted Answer · 2018-08-09T10:43:00.937

Rules : Don't add a tag in attributes of tag

You cannot do this with just a simple regular expression; you need an HTML parser. If you're doing this in a browser environment, there's one built-in for you. But almost no matter what environment you're doing this in, there's an HTML parser available for it (Node.js, Java, PHP, ...).

This answer shows how to do this in a browser. For completeness, here's that code adapted to your example (see comments):

// The array
var arr = ['eiusmod', 'sort', 'incididunt', 'dolor'];
// Create a regular expression that's an alternation of the words.
// This assumes no characters in the words that are special in regular
// expressions; if that assumption isn't valid, run the array through
// a regex-escaper function first.
var rex = new RegExp("\\b(?:" + arr.join("|") + ")\\b", "g");

// The string
var str =
    "<div data-sort=\"1\">" +
    "<h1 data-position=\"1\">Lorem ipsum dolor sit amet</h1>" +
    "<strong>search here : consectetur adipiscing elit, </strong>" +
    "<div>" +
    "sed do <u>eiusmod</u> tempor incididunt dolor ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat." +
    "</div>" +
    "Duis aute irure <i>dolor</i> in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum." +
    "</div>";

// Parse it into a temporary div
var div = document.createElement("div");
div.innerHTML = str;

// Do the updates
doReplacements(div);
console.log("done");

// Get and show the result
str = div.innerHTML;
console.log(str);

function doReplacements(element) {
    // Loop through the children of this element
    var child = element.firstChild
    while (child) {
        switch (child.nodeType) {
            case 3: // Text node
                // Update its text
                child = handleText(child);
                break;
            case 1: // Element
                // Recurse to handle this element's children
                doReplacements(child);
                child = child.nextSibling;
                break;
            default:
                child = child.nextSibling;
                break;
        }
    }
}

function handleText(node) {
  var match, targetNode, followingNode, wrapper;

  // Find the first of our target words in this node's text
  rex.lastIndex = 0;
  match = rex.exec(node.nodeValue);
  if (match) {
    // Split at the beginning of the match
    targetNode = node.splitText(match.index);

    // Split at the end of the match
    followingNode = targetNode.splitText(match[0].length);

    // Wrap the target in a "mark" element
    wrapper = document.createElement('mark');
    targetNode.parentNode.insertBefore(wrapper, targetNode);

    // Now we move the target text inside it
    wrapper.appendChild(targetNode);

    // Clean up any empty nodes (in case the target text
    // was at the beginning or end of a text node)
    if (node.nodeValue.length == 0) {
      node.parentNode.removeChild(node);
    }
    if (followingNode.nodeValue.length == 0) {
      followingNode.parentNode.removeChild(followingNode);
    }
  }
  
  // Return the next node to process, which is the sibling after our
  // wrapper if we added one, or after `node` if we didn't
  return (wrapper || node).nextSibling;
}

I don't see the add of tag mark see here : https://jsfiddle.net/c9q0ashv/ — Greg, Aug 09 '18 at 10:12
@Greg - Sorry, my brain took a few minutes off there. :-) I've fixed it now. — T.J. Crowder, Aug 09 '18 at 10:43

SamWhan · Answer 2 · 2018-08-10T09:01:30.707

Edit
Changed the answer to handle cases in comments. But you've got an elegant solution from T.J. and should use that. Just modify his regex to handle diacritics (one way of doing it can be seen in my RE).

Also, this new solution uses the sticky flag, which IE won't handle.

Another regex answer (please don't hate me ;)

The RE:

/<mark>[^<>]+<\/mark>|<[^>]*(?=>)|(^|[^a-zA-Z\u00C0-\u017F])(eiusmod|sort|incididunt|dolor|única)(?=[^a-zA-Z\u00C0-\u017F]|$)|[\s\S]/yi

It's now more code dependent, but should work. It uses alternation to identify parts of the input. Thanks to the sticky flag, y, it's forced to match every part of the input string. In descending importance:

match any <mark> already in place.
match tags, e.g. <div class="pa-title" data-title-en="" style="margin-left:0px;">
capture word from the list, preceded by a non character (including diacritics) or start of line (also captured), making sure it's followed by a non character or end of line.
match any one character

This is repeated until no match is made. The resulting string is build from the result of the matches. If capture group 2 is present, meaning its a matched word from the list, the mark tag is added around the word.

But as pointed out by several individuals - if you're, for example, attempting to scrape arbitrary web pages, it's bound to fail - use a HTML parser. Consider the words being used in an attribute, fulfilling the conditions mentioned above...

If it's a limited, known, set of pages you're working with, it could be viable to use regex.

And live it looks like this:

const regex = /<mark>[^<>]+<\/mark>|<[^>]*(?=>)|(^|[^a-zA-Z\u00C0-\u017F])(eiusmod|sort|incididunt|dolor|única)(?=[^a-zA-Z\u00C0-\u017F]|$)|[\s\S]/yi;
const str = `dolor <div data-sort="1">
<h1 data-position="1" eiusmod="foo" >Lorem ipsum dolor sit amet</h1>
<div data="eiusmod"></div>
<strong>search here : consectetur adipiscing elit, </strong>
<div>
sed do <u>eiusmod</u> tempor <mark>incididunt</mark> ut única et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
</div>
Duis aute irure <i>dolor</i> dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div><div id="pa_3577" class="pa-title" data-title-en="" style="margin-left:0px;">1.</div><div class="pa-content" style="margin-left:62px;"><p>Con la única salvedad que expresaré adelante, comparto plenamente el contenido de esta Decisión unánime.</p></div>
document.js:613:8<br/>
dolor et <mark>dolor</mark> et dolor<br/>
<a>úúnica</a> púnica dolor et dolor et dolor<br/>`;
const subst = "$1$2<mark>$3</mark>";
var result = '',
    array1;

while ((array1 = regex.exec(str)) !== null) {
  // console.log( '#' + array1[0] + '#' + array1[1] + '#' + array1[2] + '#' );
  if(array1[2] != undefined)
    result += array1[1] + '<mark>' + array1[2] + '</mark>';
  else
    result += array1[0];
}
//console.log( result );
document.write( result );

It doesn't seem to work with `dolor et dolor et dolor` and `púnica` — Julio, Aug 09 '18 at 18:49

Julio · Answer 3 · 2018-08-10T12:34:43.920

The right thing would be to use some HTML parser. However, I'll tempt you with a regular expression :-)

Replace by: $1$2<mark>$3</mark>

Demo on regex101.com

const regex = /((?:<[^>]*>[^<]*?)*?(?:(?!<mark>)<[^>]*(?=>))?)(\W|^)(eiusmod|sort|incididunt|dolor|única|feté)(?=\W|$)/gmu;

const subst = `$1$2<mark>$3</mark>`;

const str = `

única<div data-sort="1">

<p>dolor</p>
<p>única</p>

<h1 data-position="1" eiusmod="foo" >Lorem ipsum dolor sit amet</h1>
<div data="eiusmod"></div>
<strong>search here : consectetur adipiscing elit, </strong>
<div>
sed do <u>eiusmod</u> tempor <mark>incididunt</mark> ut dolor et dolor dolor magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
</div>
Duis aute irure <i>dolor</i> dolor in dolor dolor reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>

<div>
sed do <u>eiusmod</u> tempor <mark>incididunt</mark> ut única et única púnica magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
</div>
<div>fetén</div>`;

// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);

document.write(result);

The biggest difficulty was javascript not working ok with unicode characters and word boundaries (so, problems trying to find única but not punica)

Yes your solution works but in PHP, im in javascript, in your link, you can select javascript — Greg, Aug 09 '18 at 11:25
Oy yes sorry, but not works, when we have a word before the tag of beginning, see here https://regex101.com/r/1qgUBI/1 — Greg, Aug 09 '18 at 11:58
and doesn't work in this case https://regex101.com/r/vt4x5O/1 because there is special caracters — Greg, Aug 09 '18 at 12:04
I updated my answer. BTW, things like that make regular expressions not the best tool for parsing html :-) — Julio, Aug 09 '18 at 12:30
@Greg I updated my answer. Seems to work fine with a complex input — Julio, Aug 09 '18 at 18:46

Javascript, replace with regex, is it possible for this case?

3 Answers3