Thursday, 8 August 2013

How to set rule using regex in scrapy for extracting urls?

How to set rule using regex in scrapy for extracting urls?

I want to crawl pages related to Disney on bloomberg websites. The url
follow pattern as
"http://bloomberg.com/news/2013-07-08/disney-welcometohomepageofdisney"
So, i have written below rule for it
rules = [
Rule(SgmlLinkExtractor(allow=('/news/*/disney*',)), follow=True),
]
but the above rule doesn't working as i want and i am getting crawled
pages output not related to Disney. please help to fix this rule.

No comments:

Post a Comment