9

I am trying to use beautifulsoup to parse a table from a website. (I am unable to share the website source code as it is restricted use.)

I am trying to extract the data only if it has following two tags with these specific classes.

td, width=40%
tr, valign=top

My reason for doing this is to extract data which has both these tags and class.

I found some discussion on using multiple tags here but this one talks about only tags but not classes. However, I did try to extend the code with same logic of using a list but I think what I get is not what I want:

 my_soup=soup.find_all(['td',{"width":"40%"},'tr',{'valign':'top'}])

Summarizing, my query is how to use multiple tag with each having a specific class in find_all, so that the result 'ands' both the tags.

Community
  • 1
  • 1
PagMax
  • 6,209
  • 7
  • 21
  • 36
  • did you solve it? – noveleven Dec 27 '18 at 08:39
  • I just posted a bounty above, but instead of both tags, like the OP wants, I am interested if anyone can share a solution that involves a `soup.findall()` function that finds all tags that **either** have `td`/`tr` as the tag & the corollary attributes being asked for, if that makes any sense. – InfiniteFlash Aug 27 '19 at 23:00
  • As stated in the bounty, I am interested in preserving the order of the matches. – InfiniteFlash Aug 27 '19 at 23:13
  • I found an answer here, after a long search. https://stackoverflow.com/a/40305890/5874001 – InfiniteFlash Aug 28 '19 at 01:00

2 Answers2

4

You can use an re.compile object with soup.find_all:

import re
from bs4 import BeautifulSoup as soup
html = """
  <table>
    <tr style='width:40%'>
      <td style='align:top'></td>
    </tr>
  </table>
"""
results = soup(html, 'html.parser').find_all(re.compile('td|tr'), {'style':re.compile('width:40%|align:top')})

Output:

[<tr style="width:40%">
   <td style="align:top"></td>
 </tr>, <td style="align:top"></td>]

By providing the re.compile object to specify the desired tags and style values, find_all will return any instances of tr or td tag containing an inline style attribute of either width:40% or align:top.

This method can be extrapolated upon to find elements by providing multiple attribute values:

html = """
 <table>
   <tr style='width:40%'>
    <td style='align:top' class='get_this'></td>
    <td style='align:top' class='ignore_this'></td>
  </tr>
</table>
"""
results = soup(html, 'html.parser').find_all(re.compile('td|tr'), {'style':re.compile('width:40%|align:top'), 'class':'get_this'})

Output:

[<td class="get_this" style="align:top"></td>]

Edit 2: Simple recursive solution:

import bs4
from bs4 import BeautifulSoup as soup
def get_tags(d, params):
  if any((lambda x:b in x if a == 'class' else b == x)(d.attrs.get(a, [])) for a, b in params.get(d.name, {}).items()):
     yield d
  for i in filter(lambda x:x != '\n' and not isinstance(x, bs4.element.NavigableString) , d.contents):
     yield from get_tags(i, params)

html = """
 <table>
  <tr style='align:top'>
    <td style='width:40%'></td>
    <td style='align:top' class='ignore_this'></td>
 </tr>
 </table>
"""
print(list(get_tags(soup(html, 'html.parser'), {'td':{'style':'width:40%'}, 'tr':{'style':'align:top'}})))

Output:

[<tr style="align:top">
  <td style="width:40%"></td>
  <td class="ignore_this" style="align:top"></td>
 </tr>, <td style="width:40%"></td>]

The recursive function enables you to provide your own dictionary with desired target attributes for certain tags: this solution attempts to match any of the specified attributes to the bs4 object passed to the function, and if a match is discovered, the element is yielded.

Ajax1234
  • 58,711
  • 7
  • 46
  • 83
  • What if you have multiple attributes that you are interested in other than `style`? Say if you were interested filtering `style`, `id` and `class`? – InfiniteFlash Aug 28 '19 at 01:05
  • @InfiniteFlashChess `soup.find_all` will try to match every provided attribute, however, I have written a simple recursive function to provide your desired functionality as part of edit 2. – Ajax1234 Aug 28 '19 at 01:24
  • Apologies, I deleted my earlier comment because I was trying to accurately write what I wanted and realized it wasn't proper. I will try to restate myself. @Ajax1234 – InfiniteFlash Aug 28 '19 at 01:28
  • I am interested getting the tag `td` with the attribute `"width":"40%"`, and the tag `tr` with the attribute `'valign':'top'`. I do not want a `td` tag with an attribute `'valign':'top'` nor a`tr` tag with an attribute `"width":"40%"`. That's how I original interpreted the OP. Once again, apologies for making you waste your time with your recent edit. @Ajax1234 Please let me know if that makes sense. Basically, I'm trying to matching two different bs4 elements that have 2 different tags & attributes. – InfiniteFlash Aug 28 '19 at 01:33
  • @InfiniteFlashChess No problem at all. As your question currently stands, you could simply use `soup.find_all('td', {'style':"width:40%"})` for your desired `td` result and `soup.find_all('tr', {'style':"valign:top"})` for the `tr`. However, I added a recursive solution to enable you to expand your requirements by simply providing an input dictionary that specifies the target attributes for each tag. – Ajax1234 Aug 28 '19 at 02:09
  • 1
    The reason I don't want to use 2 `soup.findall()` statements is because the `tr` tag isn't nested within a the `td`tag. They are of the same hierarchy level, as the OP implies. @Ajax1234 Let me check the recursive function again, thanks. I did find an alternative solution to yours, but the least I can do is check if yours works. Also, I edit and repost quite a bit. I have no idea why they downvoted it. Probably some jealous schmuck (I upvoted it back up). – InfiniteFlash Aug 28 '19 at 12:45
  • @InfiniteFlashChess Thank you, please let me know how the recursive solution works for you. – Ajax1234 Aug 28 '19 at 13:33
  • From what I've tested, your function works as I wanted. Thanks a bunch. I will award you the bounty, when it permits me. If anyone see this post, please also see this [link](https://stackoverflow.com/a/40305890/5874001) for a somewhat similar solution. – InfiniteFlash Aug 28 '19 at 14:47
  • @InfiniteFlashChess Thank you. Glad to help! – Ajax1234 Aug 28 '19 at 14:47
1

Let's say bsObj is your beautiful soup object Try:

tr = bsObj.findAll('tr', {'valign': 'top'})
td = tr.findAll('td', {'width': '40%'})

Hope this helps.

Tarun Gupta
  • 494
  • 7
  • 11
  • 1
    I do not think it works but may be I am missing something. Output of first line is a ResultSet and when you try to do find_all on the ResultSet in second line, it throws an error saying that ResultSet does not have find_all method. I am using bs4 – PagMax Nov 08 '16 at 04:50