How to extract content between tags in html using grep command

Question

I want to write a grep command which will extract content between h1 tags irrespective of class and other attributes

I tried

 grep -o '>.*</h1>' Email.txt

But gave only three elements

score 3 · Accepted Answer · answered Apr 25 '20 at 12:38

With GNU grep, you may use

grep -oP '<h1(?:\s[^>]*)?>\K.*?(?=</h1>)' Email.txt

The -P option will enable PCRE regex engine and the pattern will match

<h1 - <h1 string
(?:\s[^>]*)? - an optional non-capturing group matching 1 or 0 occurrences of a whitespace (\s) followed with 0+ chars other than >
> - a > char
\K - match reset operator that discards the text matched so far from the match memory buffer
.*? - any 0+ chars other than line break chars, as few as possible
(?=</h1>) - a positive lookahead that matches a location that is immediately followed with </h1> substring.

1 Answers1