3

I want to write a grep command which will extract content between h1 tags irrespective of class and other attributes

I tried

 grep -o '>.*</h1>' Email.txt

But gave only three elements

1 Answers1

3

With GNU grep, you may use

grep -oP '<h1(?:\s[^>]*)?>\K.*?(?=</h1>)' Email.txt

The -P option will enable PCRE regex engine and the pattern will match

  • <h1 - <h1 string
  • (?:\s[^>]*)? - an optional non-capturing group matching 1 or 0 occurrences of a whitespace (\s) followed with 0+ chars other than >
  • > - a > char
  • \K - match reset operator that discards the text matched so far from the match memory buffer
  • .*? - any 0+ chars other than line break chars, as few as possible
  • (?=</h1>) - a positive lookahead that matches a location that is immediately followed with </h1> substring.
Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397