(update) I already solved the issue without regex by making a new dom object and iterating the child nodes. It was my senior developer who wanted it regex. His logic was, the function will be called multiple times and we don't want to bother creating a new dom node and manipulate it just to get the first line of html text.
(background) have an html text from which I need to retrieve the first line. (if it is just white space like <p></p>
, I will need to take a look at the next line. Let's assume I have the logic to check whether it is basically empty)
(goal) Find texts inside all p
or header tag (h1
,h2
..)
(condition) Html is well formed. Lines are only separated by p
or h
tag. p
and h
tag doesn't occur together.
It's like if the input is...
<ul>
<li><p>hello1</p></li>
<li><p>hello2</p></li>
</ul>
<h3> <strong>hello3</strong> </h3>
<h4> <strong><hello4></strong> </h3>
<hr>
<p> <a href="sth">sth</a> </p>
the output I'd like to get is
<p>hello1</p>
<p>hello2</p>
<strong>hello3</strong>
<strong><hello4></strong>
<a href="sth">sth</a>
I need to solve this problem with regex.
I've done this and it is faulty. I posted this question because I wasn't sure if I should modify this or just use a whole new approach/function.
function isAllWhiteSpace(txt) {
if (txt) {
txt = txt.replace(/\s/g, '').replace(/ /g, '') // remove white space
txt = txt.replace(/<[^>]*>|<[^>\/]\/>/g, '') // remove tag
if (txt.length) return false;
}
return true;
}
function getFirstLine(txt) {
const reFirstLine = /<(p|h3|h4)>(.*?)<(\/p|\/h3|\/h4)>/;
while (txt) {
const m = reFirstLine.exec(txt);
if (m) {
if (isAllWhiteSpace(m[2])) { // if all white text, search for the next p or h tag
// this is faulty.
// I omitted cases where some string comes before <p> or <h>, like <ul><li><p>...
txt = txt.slice(m[0].length + 1);
} else {
return m[2];
}
}
}
return '';
}