Python 2.7 BeautifulSoup email scraping stops before end of full database

Question

Hope you are all well! I'm new and using Python 2.7! I'm tring to extract emails from a public available directory website that does not seems to have API: this is the site: http://www.tecomdirectory.com/companies.php?segment=&activity=&search=category&submit=Search
, the code stop gathering email where on the page at the bottom where it says "load more"! Here is my code:

import requests
import re
from bs4 import BeautifulSoup
file_handler = open('mail.txt','w')

soup  = BeautifulSoup(requests.get('http://www.tecomdirectory.com/companies.php?segment=&activity=&search=category&submit=Search').content)
tags = soup('a') 
list_new =[]
for tag in tags:
    if (re.findall(r'href="mailto:([^"@]+@[^"]+)">\1</a>',('%s'%tag))): list_new = list_new +(re.findall(r'href="mailto:([^"@]+@[^"]+)">\1</a>', ('%s'%tag)))

for x in list_new:
    file_handler.write('%s\n'%x)
file_handler.close()

How can i make sure that the code goes till the end of the directory and does not stop where it shows load more? Thanks. Warmest regards

I'm guessing everything on the page after "load more" is dynamically loaded using at least some javascript. Beautifulsoup does not execute javascript, so it can't read dynamically loaded content. — Kevin, Sep 23 '16 at 19:02
Hi Kevin! Thanks for the reply. could you adivse me how to get a work around this problem or if there is any module in python that does that? Thanks — PIMg021, Sep 23 '16 at 19:09
This looks like it might be useful: [Web-scraping JavaScript page with Python](http://stackoverflow.com/q/8049520/953482) — Kevin, Sep 23 '16 at 19:11
Hi Padraic! Is so that i can extract the email only from the full file and write it in a spearte file! Is there a better or simpler way to extract all the data without stopping it at the bottom? — PIMg021, Sep 23 '16 at 19:51

Padraic Cunningham · Accepted Answer · 2016-09-23T20:08:56.973

You just need to post some data, in particular incrementing group_no to simulate clicking the load more button:

from bs4 import BeautifulSoup
import requests

# you can set whatever here to influence the results
data = {"group_no": "1",
        "search": "category",
        "segment": "",
        "activity": "",
        "retail": "",
        "category": "",
        "Bpark": "",
        "alpha": ""} 

post = "http://www.tecomdirectory.com/getautocomplete_keyword.php"

with requests.Session() as s:
    soup = BeautifulSoup(
        s.get("http://www.tecomdirectory.com/companies.php?segment=&activity=&search=category&submit=Search").content,
        "html.parser")
    print([a["href"] for a in soup.select("a[href^=mailto:]")])
    for i in range(1, 5):
        data["group_no"] = str(i)
        soup = BeautifulSoup(s.post(post, data=data).content, "html.parser")
        print([a["href"] for a in soup.select("a[href^=mailto:]")])

To go until the end, you can loop until the post returns no html, that signifies we cannot load any more pages:

def yield_all_mails():
    data = {"group_no": "1",
            "search": "category",
            "segment": "",
            "activity": "",
            "retail": "",
            "category": "",
            "Bpark": "",
            "alpha": ""}

    post = "http://www.tecomdirectory.com/getautocomplete_keyword.php"
    start = "http://www.tecomdirectory.com/companies.php?segment=&activity=&search=category&submit=Search"
    with requests.Session() as s:
        resp = s.get(start)
        soup = BeautifulSoup(s.get(start).content, "html.parser")
        yield (a["href"] for a in soup.select("a[href^=mailto:]"))
        i = 1
        while resp.content.strip():
            data["group_no"] = str(i)
            resp = s.post(post, data=data)
            soup = BeautifulSoup(resp.content, "html.parser")
            yield (a["href"] for a in soup.select("a[href^=mailto:]"))
            i += 1

So if we ran the function like below setting "alpha": "Z" to just iterate over the Z's:

from itertools import chain
for mail in chain.from_iterable(yield_all_mails()):
    print(mail)

We would get:

mailto:info@10pearls.com
mailto:fady@24group.ae
mailto:pepe@2heads.tv
mailto:2interact@2interact.us
mailto:gc@worldig.com
mailto:marilyn.pais@3i-infotech.com
mailto:3mgulf@mmm.com
mailto:venkat@4gid.com
mailto:info@4power.biz
mailto:info@4sstudyabroad.com
mailto:fouad@622agency.com
mailto:sahar@7quality.com
mailto:mike.atack@8ack.com
mailto:zyara@emirates.net.ae
mailto:aokasha@zynx.com

Process finished with exit code 0

You should put a sleep in between requests so you don't hammer the server and get yourself blocked.

Hi Padraic! Thanks as always for the help! I tried the code: However it first gives an empty list and rightafter a traceback in line 26 Nonetype object is not callable — PIMg021, Sep 23 '16 at 20:04
@PIMg021, what you see in the question output is what you should get, I have run the code myself. You must be doing something differently. What version of both requests and bs4 are you using? — Padraic Cunningham, Sep 23 '16 at 20:11
Hi Padraic! I downloed bs4 and request yesterday, so i guess they are the latest. where do i need to insert the alpha and z to test it! also the from itertool import chain part must be placed directly below the previous code? — PIMg021, Sep 23 '16 at 20:24
Yep, run it exactly as posted. The z is what letter to search under, if you leave it blank it will go from a-z, obviously that would take quite a while so I just used Z — Padraic Cunningham, Sep 23 '16 at 20:30
If i run the first part of the code it comes back with error at line 24 the on with print.... saying that NoneType object is not callable! — PIMg021, Sep 23 '16 at 20:35
Hi Padraic , i posted a new question on this problem! i'm starting to get an hang on how to use propoerly stackpverflow:) : http://stackoverflow.com/questions/39669397/python-2-7-beautifulsoup-error-type-nonetype-object-not-callable — PIMg021, Sep 23 '16 at 20:55
Hi Padric, i managed to run it, however it stops at letter K! any suggestion? Thanks. Regards — PIMg021, Sep 24 '16 at 15:38
@PIMg021, you may need to put a sleep in and/or use a new `request.post(..)` in place of the Session object. — Padraic Cunningham, Sep 24 '16 at 15:45

Python 2.7 BeautifulSoup email scraping stops before end of full database

1 Answers1

Linked