SgmlLinkExtractor not extracting the link

Question

I'm new to Python and Scrapy. I'm trying to scrape a list of website which linked from a main site.

The main site is in the format of http://www.example.com/something.aspx, and the sub-sites/children-sites that I would like SgmlLinkExtractor to extract are in the format of http://www.example.com/something.aspx?ac=N123&dc=123 where the value after ac= and dc= changes for different links. Thus, in regex, I write them as http://www\.example\.com/something\.aspx\?ac=[A-Za-z\d]+&dc=\d+

I tried to use SgmlLinkExtractor from scrapy shell. First,

    >>>link = SgmlLinkExtractor()
    >>>link.extract_links(response)

In that way, I manage to get all the links on the page.

If I use

    >>>link = SgmlLinkExtractor(allow=("http://www\.example\.com/something\.aspx\?"))
    >>>link.extract_links(response)

I can still get all links starting with http://www.example.com/something.aspx?. However, If I try with

    >>>link = SgmlLinkExtractor(allow=("http://www\.example\.com/something\.aspx\?ac=[A-Za-z\d]+&dc=\d+"))
    >>>link.extract_links(response)

or even

    >>>link = SgmlLinkExtractor(allow=("http://www\.example\.com/something\.aspx\?ac"))
    >>>link.extract_links(response)

I get nothing. >>>[] is what I get. Any idea on solving this?

EDIT

I tried again by using

    >>>link = SgmlLinkExtractor(allow=(r"ac"))
    >>>link.extract_links(response)

This is working but

    >>>link = SgmlLinkExtractor(allow=(r"ac=[A-Za-z\d]+&dc=\d+"))
    >>>link.extract_links(response)

still not working. I think the problem is probably due to ? and & in the url. Do I need any escape character or anything to make the link extractor work properly when I specify parameter for allow containing ? and &? Essentially I would need &dc... to be included.

Although one workaround is to use restrict_xpath, but I am hoping that it is actually possible for me to include ? and & in the allow parameter.

Ok, this can probably is something to do about the order of url parameters. What if you try (just to test): `"http://www\.example\.com/something\.aspx\?dc"` (dc instead of ac)? — alecxe, Apr 10 '14 at 14:50
No, it's not working. I tried a few more things and have edited the question. — Richard Wong, Apr 10 '14 at 15:01
the original url might be a bit confusing, but I think it might be more helpful for people to help on this question. The main site is `http://www.oxontime.com/naptan.aspx?t=districts&dc=146&ac=96&x=&y=&format=xhtml`, where the child-sites are `http://www.oxontime.com/naptan.aspx?ac=96&dc=146&format=xhtml&t=localities&vc=N0076640&x=451740&y=204570`, `http://www.oxontime.com/naptan.aspx?ac=96&dc=146&format=xhtml&t=localities&vc=N0076642&x=450800&y=209100`, `http://www.oxontime.com/naptan.aspx?ac=96&dc=146&format=xhtml&t=localities&vc=E0020737&x=455570&y=207910` — Richard Wong, Apr 10 '14 at 15:28
I was trying to simplifying the question by using the example.com, so from the original url, the variables will be the `vc`, `x` and `y` — Richard Wong, Apr 10 '14 at 15:30
FYI: There is some more information on [using regular expressions to match urls](http://stackoverflow.com/a/190405/2736496) in the [Stack Overflow Regular Expression FAQ](http://stackoverflow.com/a/22944075/2736496), listed under "Common Validation Tasks > Internet". — aliteralmind, Apr 10 '14 at 15:56
Thanks, [aliteralmind](http://stackoverflow.com/users/2736496/aliteralmind)! That's useful, tried to search for those kind of things but couldn't find it. I'll look at it and try to check if there is anything I can use to solve the problem. — Richard Wong, Apr 10 '14 at 19:48

SgmlLinkExtractor not extracting the link

0 Answers0