I am scraping some websites to extract the Facebook page URLs in Python 3 using Beautiful Soup. I am interested to extract only one URL for each website and that redirects into a Facebook page profile and that is not a share-type one.
For this reason I am attempting to use regular expression to extract the URLs from the href
attribute in <a>
tags. With Beautiful Soup I extract the first <a>
tag for each website, filtering by value of the href
attribute that contains a Facebook page URL.
My code is the following:
# coding=utf-8
from bs4 import BeautifulSoup
import requests
import re
def makeSoup(website):
if 'http' in website:
page = requests.get(website)
else:
page = requests.get('http://' + website)
soup = BeautifulSoup(page.content, 'html.parser')
page.close()
return soup
def facebookProfileScraper(soup):
link = soup.find('a', {'href': re.compile("https?://(www\\.)?facebook\\.com/[^(share)]?(\\w+\\.?)+")})
if link is None:
return "NaN"
return link['href']
Examples of the <a>
tags from which I'd like to extract the URLs are the following (I put numbers to identify each website also for the results I had with my attempts):
1. <a class="rss fb" href="http://www.facebook.com/gironafc" target="_blank">Facebook</a>
2. <a href="https://www.facebook.com/waterworld.parcaquatic" target="_blank"><i class="fa fa-facebook"></i></a>
3. <a class="social facebook" target="_blank" href="https://www.facebook.com/aquabrava"><span class="fa fa-facebook"></span></a>
4. <a href="https://www.facebook.com/UEO1921" target="_blank"><img alt="Facebook" height="32" src="http://www.ueolot.com/wp-content/themes/realsoccer/images/light/social-icon/facebook.png" width="32"/>
</a>
5. <a href="https://www.facebook.com/Roc%C3%B2drom-Girona-187271461378780/">Facebook</a>
6. <a class="fb_share" href="https://www.facebook.com/pages/Skydive-Empuriabrava/44214266003?fref=ts" target="_blank"></a>
First attempt
https?://(www\\.)?facebook\\.com/[^(share)]?(\\w+\\.?)+
But I got these </a>
tags:
1. <a href="http://facebook.com/share.php?src=bm&v=3&u=" target="_blank"><span class="fa fa-facebook"></span></a>
2. <a href="https://www.facebook.com/waterworld.parcaquatic" target="_blank"><i class="fa fa-facebook"></i></a>
3. <a class="social facebook" href="https://www.facebook.com/aquabrava" target="_blank"><span class="fa fa-facebook"></span></a>
4. <a href="https://www.facebook.com/UEO1921" target="_blank"><img alt="Facebook" height="32" src="http://www.ueolot.com/wp-content/themes/realsoccer/images/light/social-icon/facebook.png" width="32"/>
</a>
5. <a href="https://www.facebook.com/Roc%C3%B2drom-Girona-187271461378780/">Facebook</a>
6. <a class="fb_share" href="https://www.facebook.com/pages/Skydive-Empuriabrava/44214266003?fref=ts" target="_blank"></a>
From website 1. I get the wrong <a>
tag.
Second attempt
https?://(www\\.)?facebook\\.com/[^(share)](\\w+\\.?)+
I tried removing the ?
after [^share]
but I got the following tags:
1. <a class="rss fb" href="http://www.facebook.com/gironafc" target="_blank">Facebook</a>
2. <a href="https://www.facebook.com/waterworld.parcaquatic" target="_blank"><i class="fa fa-facebook"></i></a>
3. None
4. <a href="https://www.facebook.com/UEO1921" target="_blank"><img alt="Facebook" height="32" src="http://www.ueolot.com/wp-content/themes/realsoccer/images/light/social-icon/facebook.png" width="32"/>
</a>
5. <a href="https://www.facebook.com/Roc%C3%B2drom-Girona-187271461378780/">Facebook</a>
6.<a class="fb_share" href="https://www.facebook.com/pages/Skydive-Empuriabrava/44214266003?fref=ts" target="_blank"></a>
From website 3. I don't extract any <a>
tag because by [^(share)]
I am negating any url with a
(or any of s
, h
, r
,e
) after http://www.facebook.com/
.
Third attempt
https?://(www\\.)?facebook\\.com/(\\w+\\.?)+
I tried removing [^share]
but the tags I got were the same of the first attempt, therefore getting share type URL too.
How can I select only the a
tags that they don't have a share-type Facebook URL as href
value?