I'm new to Python and Scrapy. I'm trying to scrape a list of website which linked from a main site.
The main site is in the format of http://www.example.com/something.aspx, and the sub-sites/children-sites that I would like SgmlLinkExtractor to extract are in the format of http://www.example.com/something.aspx?ac=N123&dc=123 where the value after ac=
and dc=
changes for different links. Thus, in regex, I write them as http://www\.example\.com/something\.aspx\?ac=[A-Za-z\d]+&dc=\d+
I tried to use SgmlLinkExtractor from scrapy shell. First,
>>>link = SgmlLinkExtractor()
>>>link.extract_links(response)
In that way, I manage to get all the links on the page.
If I use
>>>link = SgmlLinkExtractor(allow=("http://www\.example\.com/something\.aspx\?"))
>>>link.extract_links(response)
I can still get all links starting with http://www.example.com/something.aspx?. However, If I try with
>>>link = SgmlLinkExtractor(allow=("http://www\.example\.com/something\.aspx\?ac=[A-Za-z\d]+&dc=\d+"))
>>>link.extract_links(response)
or even
>>>link = SgmlLinkExtractor(allow=("http://www\.example\.com/something\.aspx\?ac"))
>>>link.extract_links(response)
I get nothing. >>>[]
is what I get. Any idea on solving this?
EDIT
I tried again by using
>>>link = SgmlLinkExtractor(allow=(r"ac"))
>>>link.extract_links(response)
This is working but
>>>link = SgmlLinkExtractor(allow=(r"ac=[A-Za-z\d]+&dc=\d+"))
>>>link.extract_links(response)
still not working. I think the problem is probably due to ?
and &
in the url. Do I need any escape character or anything to make the link extractor work properly when I specify parameter for allow
containing ?
and &
? Essentially I would need &dc...
to be included.
Although one workaround is to use restrict_xpath
, but I am hoping that it is actually possible for me to include ?
and &
in the allow
parameter.