-3

Is there a way to get all the link and the text in the html file below. I've tried all means and a lot of answers but don't really get it.

<tr>
    <td><a href="pr_background-image.asp">background-image</a></td>
    <td>Specifies one or more background images for an element</td>
    <td>1</td>
</tr>

I want it to return the .asp link as well as the description below it. The new line character is my main problem and it shows up as \\r\\n

UPDATE: I don't want to use any external module. not beautifulsoup. just regex because the thing i'm working on will be shared and there will e no point if users will have to install something else`

tushortz
  • 1,972
  • 1
  • 14
  • 25
  • Check out the [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) module for parsing HTML/XML files. – mdml Jan 12 '16 at 01:34
  • 3
    As a rule of thumb, it isn't recommended to use regex to match html: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags However, i recommend that you look at the python libraries that does this for you already, like: http://stackoverflow.com/questions/17126686/extracting-data-from-html-with-python – Yuri-M-Dias Jan 12 '16 at 01:35
  • Sure. Use an HTML parser and learn XPath. – Todd A. Jacobs Jan 12 '16 at 01:36

2 Answers2

0

Using a regex to do what you are looking for is kind of hobbling, parsing the html and using xpath or dom querying would be more outwardly readable.

On top of that, even without the newlines writing a general enough regex would be a bit tricky.

see this post for multiline regexp. With that, you'll probably want to use a capture group to grab the link and another for the td cells.

Community
  • 1
  • 1
-1

The easiest way to work with html in python is BeautifulSoup or a similar module. I recommend you look into it. In case you want to stick with regex, you can allow for tabs/spaces/new lines etc. between the two <td> tags the following way:

<td><a href=\"(.+?)\">background-image<\/a><\/td>(?:\n|\r|\t|\ )*<td>(.+?)<\/td>
Mr Lister
  • 42,557
  • 14
  • 95
  • 136
Tobias
  • 417
  • 5
  • 15