python web scraping from a list of urls

Question

I am new in asks and trio in python, I got a sample code. let me explain I have a list of URL every one is news URLs, each one has sub urls. the first url requests and get all other hrefs and add in a list. then get the article of all hrefs in that list. The issue is certain times the article is getting other times empty.

tried the sample code for single urls that time its working

import asks
import trio
from goose3 import Goose
import logging as log
from goose3.configuration import ArticleContextPattern
from pprint import pprint
import json
import time

asks.init('trio') 


async def extractor(path, htmls, paths, session):

    try:
        r = await session.get(path, timeout=2)
        out = r.content
        htmls.append(out)
        paths.append(path)
    except Exception as e:
        out = str(e)
        htmls.append(out)
        paths.append(path)


async def main(path_list, session):    
    htmls = []
    paths = []
    async with trio.open_nursery() as n:
        for path in path_list:
            n.start_soon(extractor, path, htmls, paths, session)

    return htmls, paths


async def run(urls, conns=50): 


    s = asks.Session(connections=conns)
    g = Goose()

    htmls, paths = await main(urls, s)
    print(htmls,"       ",paths)
    cleaned = []
    for html, path in zip(htmls, paths):
        dic = {}
        dic['url'] = path
        if html is not None:                            
            try:
                #g.config.known_context_pattern = ArticleContextPattern(attr='class', value='the-post')
                article = g.extract(raw_html=html)
                author=article.authors
                dic['goose_text'] = article.cleaned_text
                #print(article.cleaned_text)
                #dic['goose_date'] = article.publish_datetime
                dic['goose_title'] = article.title
                if author:
                    dic['authors']=author[0]
                else:
                    dic['authors'] =''
            except Exception as e:
                raise
                print(e)
                log.info('goose found no text using html')
                dic['goose_html'] = html
                dic['goose_text'] = ''
                dic['goose_date'] = None
                dic['goose_title'] = None
                dic['authors'] =''
            cleaned.append(dic)
    return cleaned




async def real_main():
    sss= '[{"crawl_delay_sec": 0, "name": "mining","goose_text":"","article_date":"","title":"", "story_url": "http://www.mining.com/canalaska-start-drilling-west-mcarthur-uranium-project","url": "http://www.mining.com/tag/latin-america/page/1/"},{"crawl_delay_sec": 0, "name": "mining", "story_url": "http://www.mining.com/web/tesla-fires-sound-alarms-safety-electric-car-batteries", "url": "http://www.mining.com/tag/latin-america/page/1/"}]'

    obj = json.loads(sss)
    pprint(obj)

    articles=[]
    for l in obj:
      articles.append(await run([l['story_url']]))
      #await trio.sleep(3)

    pprint(articles)

if __name__ == "__main__":
    trio.run(real_main)

get the article data without missing

Please fix this example. You're passing single URLs to run(), but run() expects a list of URLs. — Matthias Urlichs, Jun 07 '19 at 17:31
Also, please move trio.run to the top level and async-ify `run`. The reason is that the current version of `asks`requires the session to be called within Trio's runtime. — Matthias Urlichs, Jun 07 '19 at 17:33
thank you for the reply, my issue is there is a list of href, each href will have article that will come in html , this is my expectation, but certain times the html is [''], can you please tell me do need a call back in trio, so that I will sure that html will get values. — newuser, Jun 10 '19 at 01:26
sorry, can you please tell me the change in trio.to make as top level and async, please alter in code. — newuser, Jun 10 '19 at 01:49
OK, did that inline. Now please fix the code as to my first comment so that it actually works – we can't figure out why code sometimes fails which doesn't work at all. — Matthias Urlichs, Jun 10 '19 at 10:05
`if __name__ == "__main__": obj = json.loads(sss) articles=[] urls=[] for l in obj: urls.append(l['story_url']) pprint(run(urls)) ok updated the code , please run the script and check if u are getting the output, if got then rerun the code there won't get output. please check the output — newuser, Jun 10 '19 at 14:54

score 0 · Answer 1 · answered Jun 07 '19 at 08:08

0

I lack some further information to answer your question in-depth, but most likely it has to do with the way goose search for text within the html. See this answer for more details: https://stackoverflow.com/a/30408761/8867146

answered Jun 07 '19 at 08:08

Ger

199
1
9

thank you for the reply, my issue is there is a list of href, each href will have article that will come in **html** , this is my expectation, but certain times the **html** is **['']**, can you please tell me do need a call back in trio, so that I will sure that **html** will get values. – newuser Jun 07 '19 at 10:58

score 0 · Answer 2 · answered Jun 10 '19 at 19:35

"asks" does not always raise an exception when the status code is != 200. You need to examine the response's status code before using its content. You might also want to increase the timeout, 2 seconds is not enough, particularly when you're firing off up to 50 connections in parallel.

In any case, here's a simplified program – all that Goose stuff is completely unnecessary for showing the actual error, two result arrays are not a good idea, and adding error messages to the result array looks broken.

Also you should investigate running the URL fetching and the processing in parallel. trio.open_memory_channel is your friend here.


import asks
asks.init('trio')

import trio
from pprint import pprint

async def extractor(path, session, results):
    try:
        r = await session.get(path, timeout=2)
        if r.status_code != 200:
            raise asks.errors.BadStatus("Not OK",r.status_code)
        out = r.content
    except Exception as e:
        # do some reasonable error handling
        print(path, repr(e))
    else:
        results.append((out, path))

async def main(path_list, session):
    results = []
    async with trio.open_nursery() as n:
        for path in path_list:
            n.start_soon(extractor, path, session, results)
    return results


async def run(conns=50):
    s = asks.Session(connections=conns)

    urls = [
            "http://www.mining.com/web/tesla-fires-sound-alarms-safety-electric-car-batteries",
            "http://www.mining.com/canalaska-start-drilling-west-mcarthur-uranium-project",
            "https://www.google.com",  # just for testing more parallel connections
            "https://www.debian.org",
            ]

    results = await main(urls, s)
    for content, path in results:
        pass  # analyze this result
    print("OK")

if __name__ == "__main__":
    trio.run(run)

hi, thanks for the reply. how can add authentication in the same request? certain URL requires username and password. — newuser, Jul 01 '19 at 10:27

python web scraping from a list of urls

2 Answers2