2

I am trying to get all links, titles & dates in a specific month, like March on the website, I'm using BeautifulSoup to do so:

from bs4 import BeautifulSoup
import requests

html_link='https://www.pds.com.ph/index.html%3Fpage_id=3261.html'
html = requests.get(html_link).text
soup = BeautifulSoup(html, 'html.parser')
for link in soup.find_all('td'):
   #Text contains 'March'
   #Get the link title & link &date

I'm new to BeautifulSoup, in Selenium I used the xpath: "//td[contains(text(),'Mar')", how can I do this with BeautifulSoup?

MendelG
  • 4,464
  • 1
  • 5
  • 22
Joyce
  • 369
  • 8

2 Answers2

4

To get all links and titles if the "date" has the text "march":

  1. Find the "date" - locate all <td> elements that have the text "march".

  2. Find the previous <a> tag using the .find_previous() method which contains the desired title and link.


import requests
from bs4 import BeautifulSoup


url = "https://www.pds.com.ph/index.html%3Fpage_id=3261.html"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

fmt_string = "{:<20} {:<130} {}"
print(fmt_string.format("Date", "Title", "Link"))
print('-' * 200)

for tag in soup.select("td:contains('March')"):
    a_tag = tag.find_previous("a")
    print(
        fmt_string.format(
            tag.text, a_tag.text, "https://www.pds.com.ph/" + a_tag["href"],
        )
    )

Output (truncated):

Date                 Title                                                                                                                              Link
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
March 31, 2021       RCBC Lists PHP 17.87257 Billion ASEAN Sustainability Bonds on PDEx                                                                 https://www.pds.com.ph/index.html%3Fp=87239.html
March 16, 2021       Aboitiz Power Corporation Raises 8 Billion Fixed Rate Bonds on PDEx                                                                https://www.pds.com.ph/index.html%3Fp=86743.html
March 1, 2021        Century Properties Group, Inc Returns to PDEx with PHP 3 Billion Fixed Rate Bonds                                                  https://www.pds.com.ph/index.html%3Fp=86366.html
March 27, 2020       BPI Lists Over PhP 33 Billion of Fixed Rate Bonds on PDEx                                                                          https://www.pds.com.ph/index.html%3Fp=74188.html
March 25, 2020       SM Prime Raises PHP 15 Billion Fixed Rate Bonds on PDEx                                                                            https://www.pds.com.ph/index.html%3Fp=74082.html
...
MendelG
  • 4,464
  • 1
  • 5
  • 22
  • 1
    Thank you so much for your help! may I ask is ```"{:<20} {:<130} {}"```a regular expression and what does ```print('-' * 200)```mean? thanks :)! – Joyce May 06 '21 at 01:35
  • @Joyce 1. `"{:<20} {:<130} {}"` fills the string with spaces. See [this](https://stackoverflow.com/questions/5676646/how-can-i-fill-out-a-python-string-with-spaces) SO post. 2. `print('-' * 200)` prints '-' 200 times. I used it for the header. try it out on a regular interpreter. – MendelG May 06 '21 at 01:44
  • 1
    got it, thank you!! – Joyce May 06 '21 at 02:31
3

Here is a solution you can try out,

import re
import requests

from bs4 import BeautifulSoup

html_link = 'https://www.pds.com.ph/index.html%3Fpage_id=3261.html'
html = requests.get(html_link).text
soup = BeautifulSoup(html, 'html.parser')

search = re.compile("March")

for td in soup.find_all('td', text=search):
    link = td.parent.select_one("td > a")

    if link:
        print(f"Title : {link.text}")
        print(f"Link : {link['href']}")
        print(f"Date : {td.text}")
        print("-" * 30)

Title : RCBC Lists PHP 17.87257 Billion ASEAN Sustainability Bonds on PDEx
Link : index.html%3Fp=87239.html
Date : March 31, 2021
------------------------------
Title : Aboitiz Power Corporation Raises 8 Billion Fixed Rate Bonds on PDEx
Link : index.html%3Fp=86743.html
Date : March 16, 2021
------------------------------
....
sushanth
  • 6,960
  • 3
  • 13
  • 23
  • thank you so much for your help! it really helps, but may I ask how could it only find Mar in 2021? – Joyce May 05 '21 at 03:28
  • Just fix the regex, ``search = re.compile(r"March.+2021")`` – sushanth May 05 '21 at 03:30
  • Hi I am confused with ```td.parent.select_one("td > a")```, is this like ancestor:: in selenium? to choose the parent tag of td ? thanks! – Joyce May 12 '21 at 09:58