Is there a truly universal wildcard in Grep?

Question

Really basic question here. So I'm told that a dot . matches any character EXCEPT a line break. I'm looking for something that matches any character, including line breaks.

All I want to do is to capture all the text in a website page between two specific strings, stripping the header and the footer. Something like HEADER TEXT(.+)FOOTER TEXT and then extract what's in the parentheses, but I can't find a way to include all text AND line breaks between header and footer, does this make sense? Thanks in advance!

score 7 · Accepted Answer · answered Dec 13 '09 at 19:16

7

When I need to match several characters, including line breaks, I do:

[\s\S]*?

Note I'm using a non-greedy pattern

answered Dec 13 '09 at 19:16

Rubens Farias

54,126
8
125
158

2

Thanks guys! What a friendly, useful site. I forgot to mention that I was using grep search in BBEdit, this works wonderfully. You all rock! – Tom B Dec 13 '09 at 19:39

Greg Bacon · Answer 2 · 2009-12-13T19:29:39.230

You could do it with Perl:

$ perl -ne 'print if /HEADER TEXT/ .. /FOOTER TEXT/' file.html

To print only the text between the delimiters, use

$ perl -000 -lne 'print $1 while /HEADER TEXT(.+?)FOOTER TEXT/sg' file.html

The /s switch makes the regular expression matcher treat the entire string as a single line, which means dot matches newlines, and /g means match as many times as possible.

The examples above assume you're cranking on HTML files on the local disk. If you need to fetch them first, use get from LWP::Simple:

$ perl -MLWP::Simple -le '$_ = get "http://stackoverflow.com";
                          print $1 while m!<head>(.+?)</head>!sg'

Please note that parsing HTML with regular expressions as above does not work in the general case! If you're working on a quick-and-dirty scanner, fine, but for an application that needs to be more robust, use a real parser.

Jonathan Leffler · Answer 3 · 2009-12-13T19:33:34.773

By definition, grep looks for lines which match; it reads a line, sees whether it matches, and prints the line.

One possible way to do what you want is with sed:

sed -n '/HEADER TEXT/,/FOOTER TEXT/p' "$@"

This prints from the first line that matches 'HEADER TEXT' to the first line that matches 'FOOTER TEXT', and then iterates; the '-n' stops the default 'print each line' operation. This won't work well if the header and footer text appear on the same line.

To do what you want, I'd probably use perl (but you could use Python if you prefer). I'd consider slurping the whole file, and then use a suitably qualified regex to find the matching portions of the file. However, the Perl one-liner given by '@gbacon' is an almost exact transliteration into Perl of the 'sed' script above and is neater than slurping.

score 2 · Answer 4 · answered Dec 13 '09 at 19:11

2

The man page of grep says:

grep, egrep, fgrep, rgrep - print lines matching a pattern

grep is not made for matching more than a single line. You should try to solve this task with perl or awk.

answered Dec 13 '09 at 19:11

tangens

36,703
18
113
134

score 2 · Answer 5 · answered Aug 09 '11 at 12:05

2

As this is tagged with 'bbedit' and BBedit supports Perl-Style Pattern Modifiers you can allow the dot to match linebreaks with the switch (?s)

(?s).

will match ANY character. And yes, (?s).+ will match the whole text.

answered Aug 09 '11 at 12:05

kaidoh

21
2

phtrivier · Answer 6 · 2009-12-13T19:53:46.560

1

As pointed elsewhere, grep will work for single line stuff.

For multiple-lines (in ruby with Regexp::MULTILINE, or in python, awk, sed, whatever), "\s" should also capture line breaks, so

HEADER TEXT(.*\s*)FOOTER TEXT

might work ...

edited Dec 13 '09 at 19:53

answered Dec 13 '09 at 19:09

phtrivier

12,156
4
42
73

You'd have to be reading the file in a mode that scans multiple lines into memory for that to work. – Jonathan Leffler Dec 13 '09 at 19:35
Thanks, I added how you would do that in Ruby. IIRC, that's /g in perlish, isn't it ? – phtrivier Dec 13 '09 at 19:54

score 0 · Answer 7 · answered Dec 14 '09 at 00:02

0

here's one way to do it with gawk, if you have it

awk -vRS="FOOTER" '/HEADER/{gsub(/.*HEADER/,"");print}' file

answered Dec 14 '09 at 00:02

ghostdog74

286,686
52
238
332

Is there a truly universal wildcard in Grep?

7 Answers7