Is there a way to find class name and take the whole text of parent tag?

Question

I have a lot of html files and I have to take the full header of files. Tags of headers located differently: class="c6", class="c7"

I have tried BeautifulSoup

for head_c6 in soup.find_all('span', attrs={'class': 'c6'}):
        print(head_c6.get_text())
for head_c7 in soup.find_all('span', attrs={'class': 'c7'}):
        print(head_c7.get_text())

but the result:

Q3 2017 American Express Co Earnings Call - Final LENGTH:

Q2 2016 Akamai Technologies Inc Call - Final Earnings

Here how different files look like:

File 1

<div class="c4">
<p class="c5">
<span class="c6">
      Q3 2017 American Express Co Earnings Call - Final
     </span>
</p>
</div>
<div class="c4">
<p class="c5">
<span class="c7">
      LENGTH:
     </span>
<span class="c2">
      11051 words
     </span>
</p>
</div>

File 2

<div class="c4">
<p class="c5">
<span class="c6">
      Q2 2018 Akamai Technologies Inc
     </span>
<span class="c7">
      Earnings
     </span>
<span class="c6">
      Call - Final
     </span>
</p>
</div>

File 3

<div class="c4">
    <p class="c5">
     <span class="c6">
      Q4 2018
     </span>
     <span class="c7">
      Facebook
     </span>
     <span class="c6">
      Inc
     </span>
     <span class="c7">
      Earnings
     </span>
     <span class="c6">
      Call - Final
     </span>
    </p>

What I want is get full text of header:

Q3 2017 American Express Co Earnings Call - Final

Q2 2018 Akamai Technologies Inc Earnings Call - Final

Q4 2018 Facebook Inc Earnings Call - Final

Possible duplicate of [How to find elements by class](https://stackoverflow.com/questions/5041008/how-to-find-elements-by-class) — Siddharth Das, May 14 '19 at 09:58
@Siddharth Das I have edited, showing the result of find_all — NKam, May 14 '19 at 10:12
I know you accepted an answer, but I'm still confused: the phrase "Earnings Call - Final" is in all outputs; assuming that's also the case in real life, why bother looking for it? Don't you really only need the name of the company and fiscal quarter? — Jack Fleeting, May 14 '19 at 11:39
I have shown only 3 companies. There are about 1500 companies and I have differential ends like "Earnings Conference Call - Final". — NKam, May 14 '19 at 12:18

KunduK · Accepted Answer · 2019-05-14T10:14:14.687

Use Regular expression re I have updated the last file html.You can do it same with remaining files

from bs4 import BeautifulSoup
import re
data='''<div class="c4">
    <p class="c5">
     <span class="c6">
      Q4 2018
     </span>
     <span class="c7">
      Facebook
     </span>
     <span class="c6">
      Inc
     </span>
     <span class="c7">
      Earnings
     </span>
     <span class="c6">
      Call - Final
     </span>
    </p>'''

soup=BeautifulSoup(data,'html.parser')

items=[item.text.strip() for item in soup.find_all('span', class_=re.compile("c"))]
stritem=' '.join(items)
print(stritem.replace('\n',''))

Output:

 Q4 2018 Facebook Inc Earnings Call - Final

You can also use following way.

items=[item.text.strip() for item in soup.find_all('span', class_=re.compile("c6|c7"))]
stritem=' '.join(items)
print(stritem.replace('\n',''))

or to get the parent tag text try that.

from bs4 import BeautifulSoup
import re
data='''<div class="c4">
    <p class="c5">
     <span class="c6">
      Q4 2018
     </span>
     <span class="c7">
      Facebook
     </span>
     <span class="c6">
      Inc
     </span>
     <span class="c7">
      Earnings
     </span>
     <span class="c6">
      Call - Final
     </span>
    </p>'''

soup=BeautifulSoup(data,'html.parser')
childtag=soup.find('span', class_=re.compile("c6|c7"))
parenttag=childtag.parent
print(parenttag.text.replace('\n',''))

I have many c6 and c7 class tags in file. The first solution took everything and the second worked well on Facebook and Akamai Technologies but Q3 2017 American Express Co Earnings Call - Final LENGTH: LOAD-DATE: LANGUAGE:. I do not need LENGTH: LOAD-DATE: LANGUAGE: — NKam, May 14 '19 at 10:36

bharatk · Answer 2 · 2019-07-13T06:51:27.237

strip() in-built function of Python is used to remove all the leading and trailing spaces from a string.

str.join(iterable) - Return a string which is the concatenation of the strings in iterable.

from bs4 import BeautifulSoup

html1 = ''' <div class="c4">
    <p class="c5">
     <span class="c6">
      Q4 2018
     </span>
     <span class="c7">
      Facebook
     </span>
     <span class="c6">
      Inc
     </span>
     <span class="c7">
      Earnings
     </span>
     <span class="c6">
      Call - Final
     </span>
    </p></div>'''

soup = BeautifulSoup(html1,'lxml')
tag =  soup.find('div',{'class':'c4'})
header = ' '.join(("".join((tag.text.strip()).split('\n'))).split())
print(header)

O/P

Q4 2018 Facebook Inc Earnings Call - Final

score 0 · Answer 3 · answered May 14 '19 at 10:54

It seems easier, and more efficient, to pass an Or list to select

from bs4 import BeautifulSoup as bs

html = '''<div class="c4">
    <p class="c5">
     <span class="c6">
      Q4 2018
     </span>
     <span class="c7">
      Facebook
     </span>
     <span class="c6">
      Inc
     </span>
     <span class="c7">
      Earnings
     </span>
     <span class="c6">
      Call - Final
     </span>
    </p>'''

soup= bs(html,'html.parser')  
result = ' '.join([item.text.strip() for item in soup.select('.c6,.c7')])
print(result)

Is there a way to find class name and take the whole text of parent tag?

3 Answers3