3

I am scraping some websites to extract the Facebook page URLs in Python 3 using Beautiful Soup. I am interested to extract only one URL for each website and that redirects into a Facebook page profile and that is not a share-type one.

For this reason I am attempting to use regular expression to extract the URLs from the href attribute in <a> tags. With Beautiful Soup I extract the first <a> tag for each website, filtering by value of the href attribute that contains a Facebook page URL.

My code is the following:

# coding=utf-8
from bs4 import BeautifulSoup
import requests
import re


def makeSoup(website):
    if 'http' in website:
        page = requests.get(website)
    else:
        page = requests.get('http://' + website)
    soup = BeautifulSoup(page.content, 'html.parser')
    page.close()
    return soup


def facebookProfileScraper(soup):
    link = soup.find('a', {'href': re.compile("https?://(www\\.)?facebook\\.com/[^(share)]?(\\w+\\.?)+")})
    if link is None:
        return "NaN"
    return link['href'] 

Examples of the <a> tags from which I'd like to extract the URLs are the following (I put numbers to identify each website also for the results I had with my attempts):

1. <a class="rss fb" href="http://www.facebook.com/gironafc" target="_blank">Facebook</a>
2. <a href="https://www.facebook.com/waterworld.parcaquatic" target="_blank"><i class="fa fa-facebook"></i></a>
3. <a class="social facebook" target="_blank" href="https://www.facebook.com/aquabrava"><span class="fa fa-facebook"></span></a>
4. <a href="https://www.facebook.com/UEO1921" target="_blank"><img alt="Facebook" height="32" src="http://www.ueolot.com/wp-content/themes/realsoccer/images/light/social-icon/facebook.png" width="32"/>
</a>
5. <a href="https://www.facebook.com/Roc%C3%B2drom-Girona-187271461378780/">Facebook</a>
6. <a class="fb_share" href="https://www.facebook.com/pages/Skydive-Empuriabrava/44214266003?fref=ts" target="_blank"></a>

First attempt

https?://(www\\.)?facebook\\.com/[^(share)]?(\\w+\\.?)+

But I got these </a> tags:

1. <a href="http://facebook.com/share.php?src=bm&amp;v=3&amp;u=" target="_blank"><span class="fa fa-facebook"></span></a>
2. <a href="https://www.facebook.com/waterworld.parcaquatic" target="_blank"><i class="fa fa-facebook"></i></a>
3. <a class="social facebook" href="https://www.facebook.com/aquabrava" target="_blank"><span class="fa fa-facebook"></span></a>
4. <a href="https://www.facebook.com/UEO1921" target="_blank"><img alt="Facebook" height="32" src="http://www.ueolot.com/wp-content/themes/realsoccer/images/light/social-icon/facebook.png" width="32"/>
</a>
5. <a href="https://www.facebook.com/Roc%C3%B2drom-Girona-187271461378780/">Facebook</a>
6. <a class="fb_share" href="https://www.facebook.com/pages/Skydive-Empuriabrava/44214266003?fref=ts" target="_blank"></a>

From website 1. I get the wrong <a> tag.

Second attempt

https?://(www\\.)?facebook\\.com/[^(share)](\\w+\\.?)+

I tried removing the ? after [^share] but I got the following tags:

1. <a class="rss fb" href="http://www.facebook.com/gironafc" target="_blank">Facebook</a>
2. <a href="https://www.facebook.com/waterworld.parcaquatic" target="_blank"><i class="fa fa-facebook"></i></a>
3. None
4. <a href="https://www.facebook.com/UEO1921" target="_blank"><img alt="Facebook" height="32" src="http://www.ueolot.com/wp-content/themes/realsoccer/images/light/social-icon/facebook.png" width="32"/>
</a>
5. <a href="https://www.facebook.com/Roc%C3%B2drom-Girona-187271461378780/">Facebook</a>
6.<a class="fb_share" href="https://www.facebook.com/pages/Skydive-Empuriabrava/44214266003?fref=ts" target="_blank"></a>

From website 3. I don't extract any <a> tag because by [^(share)] I am negating any url with a (or any of s, h, r,e) after http://www.facebook.com/.

Third attempt

https?://(www\\.)?facebook\\.com/(\\w+\\.?)+

I tried removing [^share] but the tags I got were the same of the first attempt, therefore getting share type URL too.

How can I select only the a tags that they don't have a share-type Facebook URL as href value?

QHarr
  • 72,711
  • 10
  • 44
  • 81
silviacamplani
  • 336
  • 3
  • 16

3 Answers3

3
def foo(url):
    l = []
    soup = BeautifulSoup(requests.get(url).text, "html.parser")
    links = soup.find_all("a")
    for link in links:
        if not "share" in link.get("href").lower():
            l.append(link)
    return l

This function checks for share in links and return links without share in it.

Yashik
  • 260
  • 3
  • 14
2

I found a solution by improving the regex. This question helped me a lot. Here's the regex for my case:

https?://(www\.)?facebook\.com/(?!share\.php).(\S+\.?)+

It matches all of the <a> tags with a Facebook page URL as href value.

With the regex (?!anywordorexpression). it will not be extracted any string containing the anywordorexpression substring.

silviacamplani
  • 336
  • 3
  • 16
1

You can use more efficient css attribute selectors with :not and * contains operator using bs4 4.7.1

links = [item['href'] for item in soup.select("[href^='https://www.facebook.com/']:not([href*='share'])")]

For the first link only

link = soup.select_one("[href^='https://www.facebook.com/']:not([href*='share'])")['href']
QHarr
  • 72,711
  • 10
  • 44
  • 81