0

I was trying to scrape a page http://apps.leg.wa.gov/wac/default.aspx?cite=296-17A&full=true to get an output of type

'6903-03' : u'Aerial spraying, seeding, crop dusting, or firefighting' ,
'6510-00' : u'Domestic servants/home care assistants employed in or about the private residence
of a home owner' ,
'1407-00' : u'Bus companies' ,

I am using scrapy for the same. I used the following xpath for it

response.xpath('//*[@id="ctl00_ContentPlaceHolder1_dlSectionContent"]/span/div/span/text()').extract())

Even though it works properly but it returns some not required lines as well like these

u'which also provide farm kill operations away from the custom meat shop',
u'Farm kill operations',
u'only',
u'no farm kill',
u'only',
u'4302-16 Farm kill',
u'exclusively',
u'only; ',
u'only',
u'no farm kill',
u'including farm kill',

One way I was trying to think was to do a regex on each line to identify the pattern line with regex as u'(?:\d{2}){2}-(?:\d{1}){2} [A-Za-z ]*'

Is there any better or cleaner approach of identifying such spans.

PS:- The spans don't have any classes. They have only style. I am not sure if I can use styles for identifying the required spans.

srikavineehari
  • 2,176
  • 1
  • 9
  • 19
Ankuj
  • 663
  • 3
  • 10
  • 26

1 Answers1

0

The XPath might be more specific and include h3 tags to be able to reference its next sibling

'//*[@id="ctl00_ContentPlaceHolder1_dlSectionContent"]/descendant::h3/following-sibling::div[1]/span/text()'

can be tested in Linux/Cygwin with

xmllint --recover --html --xpath '//*[@id="ctl00_ContentPlaceHolder1_dlSectionContent"]/descendant::h3/following-sibling::div[1]/span' ~/tmp/test.html| sed -re 's%<span style=[^>]+>([^<]+)</span>%\1\n%g' | less

sample output

0101-00 Land clearing: Highway, street and road construction, N.O.C.
0103-09 Drilling or blasting: N.O.C.
0104-12 Dredging, N.O.C.
Luis Muñoz
  • 5,935
  • 2
  • 20
  • 38