1

I'm new to Python and Scrapy. I'm trying to scrape a list of website which linked from a main site.

The main site is in the format of http://www.example.com/something.aspx, and the sub-sites/children-sites that I would like SgmlLinkExtractor to extract are in the format of http://www.example.com/something.aspx?ac=N123&dc=123 where the value after ac= and dc= changes for different links. Thus, in regex, I write them as http://www\.example\.com/something\.aspx\?ac=[A-Za-z\d]+&dc=\d+

I tried to use SgmlLinkExtractor from scrapy shell. First,

    >>>link = SgmlLinkExtractor()
    >>>link.extract_links(response)

In that way, I manage to get all the links on the page.

If I use

    >>>link = SgmlLinkExtractor(allow=("http://www\.example\.com/something\.aspx\?"))
    >>>link.extract_links(response)

I can still get all links starting with http://www.example.com/something.aspx?. However, If I try with

    >>>link = SgmlLinkExtractor(allow=("http://www\.example\.com/something\.aspx\?ac=[A-Za-z\d]+&dc=\d+"))
    >>>link.extract_links(response)

or even

    >>>link = SgmlLinkExtractor(allow=("http://www\.example\.com/something\.aspx\?ac"))
    >>>link.extract_links(response)

I get nothing. >>>[] is what I get. Any idea on solving this?

EDIT

I tried again by using

    >>>link = SgmlLinkExtractor(allow=(r"ac"))
    >>>link.extract_links(response)

This is working but

    >>>link = SgmlLinkExtractor(allow=(r"ac=[A-Za-z\d]+&dc=\d+"))
    >>>link.extract_links(response)

still not working. I think the problem is probably due to ? and & in the url. Do I need any escape character or anything to make the link extractor work properly when I specify parameter for allow containing ? and &? Essentially I would need &dc... to be included.

Although one workaround is to use restrict_xpath, but I am hoping that it is actually possible for me to include ? and & in the allow parameter.

Richard Wong
  • 2,848
  • 4
  • 15
  • 18
  • Ok, this can probably is something to do about the order of url parameters. What if you try (just to test): `"http://www\.example\.com/something\.aspx\?dc"` (dc instead of ac)? – alecxe Apr 10 '14 at 14:50
  • No, it's not working. I tried a few more things and have edited the question. – Richard Wong Apr 10 '14 at 15:01
  • Okay, what about `ac=.*?&dc=.*`? – alecxe Apr 10 '14 at 15:06
  • Same thing. Not working either... – Richard Wong Apr 10 '14 at 15:09
  • ok, what if you just look for `\?(ac|dc)`? – alecxe Apr 10 '14 at 15:16
  • if you can share the original url that would be great – akhter wahab Apr 10 '14 at 15:21
  • the original url might be a bit confusing, but I think it might be more helpful for people to help on this question. The main site is `http://www.oxontime.com/naptan.aspx?t=districts&dc=146&ac=96&x=&y=&format=xhtml`, where the child-sites are `http://www.oxontime.com/naptan.aspx?ac=96&dc=146&format=xhtml&t=localities&vc=N0076640&x=451740&y=204570`, `http://www.oxontime.com/naptan.aspx?ac=96&dc=146&format=xhtml&t=localities&vc=N0076642&x=450800&y=209100`, `http://www.oxontime.com/naptan.aspx?ac=96&dc=146&format=xhtml&t=localities&vc=E0020737&x=455570&y=207910` – Richard Wong Apr 10 '14 at 15:28
  • I was trying to simplifying the question by using the example.com, so from the original url, the variables will be the `vc`, `x` and `y` – Richard Wong Apr 10 '14 at 15:30
  • FYI: There is some more information on [using regular expressions to match urls](http://stackoverflow.com/a/190405/2736496) in the [Stack Overflow Regular Expression FAQ](http://stackoverflow.com/a/22944075/2736496), listed under "Common Validation Tasks > Internet". – aliteralmind Apr 10 '14 at 15:56
  • Thanks, [aliteralmind](http://stackoverflow.com/users/2736496/aliteralmind)! That's useful, tried to search for those kind of things but couldn't find it. I'll look at it and try to check if there is anything I can use to solve the problem. – Richard Wong Apr 10 '14 at 19:48

0 Answers0