Find all p tag in html string (regex)(javascript)

Question

(update) I already solved the issue without regex by making a new dom object and iterating the child nodes. It was my senior developer who wanted it regex. His logic was, the function will be called multiple times and we don't want to bother creating a new dom node and manipulate it just to get the first line of html text.

(background) have an html text from which I need to retrieve the first line. (if it is just white space like <p></p>, I will need to take a look at the next line. Let's assume I have the logic to check whether it is basically empty)

(goal) Find texts inside all p or header tag (h1,h2..)

(condition) Html is well formed. Lines are only separated by p or h tag. p and h tag doesn't occur together.

It's like if the input is...

<ul>
   <li><p>hello1</p></li>
   <li><p>hello2</p></li>
</ul>
<h3> <strong>hello3</strong> </h3>
<h4> <strong><hello4></strong> </h3>
<hr>
<p> <a href="sth">sth</a> </p>

the output I'd like to get is

<p>hello1</p>
<p>hello2</p>
<strong>hello3</strong>
<strong><hello4></strong>

<a href="sth">sth</a>

I need to solve this problem with regex.

I've done this and it is faulty. I posted this question because I wasn't sure if I should modify this or just use a whole new approach/function.

function isAllWhiteSpace(txt) {
  if (txt) {
    txt = txt.replace(/\s/g, '').replace(/&nbsp/g, '') // remove white space
    txt = txt.replace(/<[^>]*>|<[^>\/]\/>/g, '') // remove tag
    if (txt.length) return false;
  }
  return true;
}

function getFirstLine(txt) {
  const reFirstLine = /<(p|h3|h4)>(.*?)<(\/p|\/h3|\/h4)>/;
  while (txt) {
    const m = reFirstLine.exec(txt);
    if (m) {
      if (isAllWhiteSpace(m[2])) { // if all white text, search for the next p or h tag
        // this is faulty.
        // I omitted cases where some string comes before <p> or <h>, like <ul><li><p>...
        txt = txt.slice(m[0].length + 1);
      } else {
        return m[2];
      }
    }
  }
  return '';
}

have you tried with document.getElementsByTagName("p")?.That gives you an array then use arr[index].innerText — Hari, May 27 '20 at 15:31
Possible duplicate of [regex-select-all-text-between-tags](https://stackoverflow.com/questions/7167279/regex-select-all-text-between-tags) — , May 27 '20 at 15:32
why do you need to use a regex (it's a bad idea)? mandatory link to [the legendary answer](https://stackoverflow.com/a/1732454/7393478) — Kaddath, May 27 '20 at 15:32
[Parsing HTML with regex is a hard job](https://stackoverflow.com/a/4234491/372239) HTML and regex are not good friends. Use a parser, it is simpler, faster and much more maintainable. — Toto, May 28 '20 at 11:48

score 1 · Answer 1 · answered May 27 '20 at 16:37

You can do it using querySelectorAll to find any tag you want, separated by comma, then loop items:

Get trimmed content
Ignore if content is only spaces
Check parent element, if item belongs to BODY add current content, without outer tags
If item doesn't belong to BODDY add current content, including outer tags

let items = document.querySelectorAll('p, h3, h4');
let output = '';
items.forEach(item => {
    let content = item.innerHTML.trim();
    if(content.length > 0) {
        output += '\n';
        output += (item.parentNode.tagName == 'BODY') ? content : item.outerHTML;
    }
});
console.log(output);

<ul>
   <li><p>hello1</p></li>
   <li><p>hello2</p></li>
</ul>
<h3> <strong>hello3</strong> </h3>
<h4> <strong><hello4></strong> </h3>
<hr>
<p> <a href="sth">sth</a> </p>

Output:

<p>hello1</p>
<p>hello2</p>
<strong>hello3</strong>
<strong><hello4></hello4></strong>
<a href="sth">sth</a>

Now you can add some more code to deal with <hello4></hello4>

score 0 · Answer 2 · answered May 28 '20 at 09:22

Solved by using g flag. Seems to catch most cases.

function getFirstLine(txt) {
  if (txt) {
    txt = txt.trim();
    const reEachLine = /<(p|h[0-9]*)>(.*?)<(\/p|\/h[0-9]*)>/g;
    const reInnerText = /<[^>]*>([\s\S]*)<\/[^>]*>/;
    const m = txt.match(reEachLine);
    if (m) {
      for (let i = 0; i < m.length; i += 1) {
        const innerText = m[i].trim().match(reInnerText);
        if (innerText) {
          if (!isAllWhiteSpace(innerText[1])) return innerText[1];
        } else {
          return 'error loading first line - 1'; // no innertext exists
        }
      }
      return ''; // all white text
    }
    return ''; // no match for <p>, <h*>
  } return ''; // txt was not string or empty
}

Find all p tag in html string (regex)(javascript)

2 Answers2