0

Assume I am having a html string containing the following code snippet.

... <img class="employee thumb" src="http://localhost/services/employee1.jpg" /> ... 

I want to search whether this tag is available and if so get the src url. <img class="employee thumb" can be used to uniquely identify the tag.

How to do this in python?

Patrick Hofman
  • 143,714
  • 19
  • 222
  • 294
Yasitha
  • 2,073
  • 3
  • 21
  • 33
  • 7
    Why use regular expressions when [excellent HTML parsers](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) are available? `soup = BeautifulSoup(yourpage)`, then `image = soup.select('img.employee.thumb')`. – Martijn Pieters Mar 28 '14 at 10:34
  • Maybe using regexp for parsing html is not the best approach. This Answer talk about that: http://stackoverflow.com/a/1732454/661140 – Roberto Mar 28 '14 at 10:37
  • Thanks for the info. I am getting the html using `page =urllib2.urlopen(url)` and `yourpage=page.read()`. Then I couldn't parse the html as you mentioned. Any thoughts? – Yasitha Mar 28 '14 at 12:04
  • Although [you could do so](http://stackoverflow.com/a/4234491/471272), I do not recommend that route. – tchrist Jun 06 '14 at 22:43

1 Answers1

1

Using Regular Expression :

>>> import re
>>> str =  '<img class="employee thumb" src="http://localhost/services/employee1.jpg" />'
>>> if re.search('img class="employee thumb"', str):
...     print re.findall ( 'src="(.*?)"', s, re.DOTALL)
... 
['http://localhost/services/employee1.jpg']

Using lxml :

>>> from lxml import etree
>>> root = etree.fromstring("""
... <html>
...     <img class="employee thumb" src="http://localhost/services/employee1.jpg" />
... </html>
... """)
>>> print root.xpath("//img[@class='employee thumb']/@*")[1]
http://localhost/services/employee1.jpg
Tanveer Alam
  • 4,661
  • 3
  • 19
  • 41
  • The `lxml` version isn't much use; it doesn't actually search for the `img` tag in a larger document. – Martijn Pieters Mar 28 '14 at 11:49
  • 1
    No, you still only test if `root` is the image tag. The OP has a larger chunk of HTML, not just containing the `` tag. – Martijn Pieters Mar 28 '14 at 11:58
  • I think the input is in string format as mentioned in the question. So i think the only concern is about getting 'src' if class attrib is 'employee thumb'. – Tanveer Alam Mar 28 '14 at 12:02
  • No, the OP's first sentence is *Assume I am having a html string **containing the following code snippet***, emphasis mine. Note the `...` ellipsis in the HTML sample as well. – Martijn Pieters Mar 28 '14 at 12:05
  • Yeah that is what I am saying, it is specifically mention in the question itself. Assume I am having a html string containing the following code snippet. ... . – Tanveer Alam Mar 28 '14 at 12:08
  • tag is not the root of my html string. It just a part of it as mentioned. – Yasitha Mar 28 '14 at 12:26
  • @Yasita, I have edited it, now is not the root. – Tanveer Alam Mar 28 '14 at 13:12