Getting a string from website with regex without using external module

Question

Is there a way to get all the link and the text in the html file below. I've tried all means and a lot of answers but don't really get it.

<tr>
    <td><a href="pr_background-image.asp">background-image</a></td>
    <td>Specifies one or more background images for an element</td>
    <td>1</td>
</tr>

I want it to return the .asp link as well as the description below it. The new line character is my main problem and it shows up as \\r\\n

UPDATE: I don't want to use any external module. not beautifulsoup. just regex because the thing i'm working on will be shared and there will e no point if users will have to install something else`

Check out the [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) module for parsing HTML/XML files. — mdml, Jan 12 '16 at 01:34
As a rule of thumb, it isn't recommended to use regex to match html: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags However, i recommend that you look at the python libraries that does this for you already, like: http://stackoverflow.com/questions/17126686/extracting-data-from-html-with-python — Yuri-M-Dias, Jan 12 '16 at 01:35

score 0 · Answer 1 · edited May 23 '17 at 11:46

0

Using a regex to do what you are looking for is kind of hobbling, parsing the html and using xpath or dom querying would be more outwardly readable.

On top of that, even without the newlines writing a general enough regex would be a bit tricky.

see this post for multiline regexp. With that, you'll probably want to use a capture group to grab the link and another for the td cells.

edited May 23 '17 at 11:46

Community

1
1

answered Jan 12 '16 at 02:11

huronbikes

1

score -1 · Answer 2 · edited Jan 17 '16 at 20:04

-1

The easiest way to work with html in python is BeautifulSoup or a similar module. I recommend you look into it. In case you want to stick with regex, you can allow for tabs/spaces/new lines etc. between the two <td> tags the following way:

<td><a href=\"(.+?)\">background-image<\/a><\/td>(?:\n|\r|\t|\ )*<td>(.+?)<\/td>

edited Jan 17 '16 at 20:04

Mr Lister

42,557
14
95
136

answered Jan 12 '16 at 01:34

Tobias

417
5
15

I don't want too use beautifulsoup because of the nature of the project @Tobias R – tushortz Jan 12 '16 at 08:31

Getting a string from website with regex without using external module

2 Answers2