32

This might be one of those questions that are difficult to answer, but here goes:

I don't consider my self programmer - but I would like to :-) I've learned R, because I was sick and tired of spss, and because a friend introduced me to the language - so I am not a complete stranger to programming logic.

Now I would like to learn python - primarily to do screen scraping and text analysis, but also for writing webapps with Pylons or Django.

So: How should I go about learning to screen scrape with python? I started going through the scrappy docs but I feel to much "magic" is going on - after all - I am trying to learn, not just do.

On the other hand: There is no reason to reinvent the wheel, and if Scrapy is to screen scraping what Django is to webpages, then It might after all be worth jumping straight into Scrapy. What do you think?

Oh - BTW: The kind of screen scraping: I want to scrape newspaper sites (i.e. fairly complex and big) for mentions of politicians etc. - That means I will need to scrape daily, incrementally and recursively - and I need to log the results into a database of sorts - which lead me to a bonus question: Everybody is talking about nonSQL DB. Should I learn to use e.g. mongoDB right away (I don't think I need strong consistency), or is that foolish for what I want to do?

Thank you for any thoughts - and I apologize if this is to general to be considered a programming question.

Do Nhu Vy
  • 33,131
  • 37
  • 143
  • 202
Andreas
  • 6,066
  • 12
  • 49
  • 65
  • 1
    One thing that helps you when creating good scrapers, is the knowledge you have about HTTP/Web (Cookies, Redirections,...) ;) – Oscar Mederos Dec 04 '10 at 09:27
  • Not a direct answer to your question, but you might want to check out this video: https://www.youtube.com/watch?v=52wxGESwQSA it covers a lot of the more advanced topics about screenscraping. It comes at it from the perspective of python, but for the most part it does it in terms of theory and is largely language agnostic. – Ape-inago Aug 30 '14 at 13:28

6 Answers6

47

I agree that the Scrapy docs give off that impression. But, I believe, as I found for myself, that if you are patient with Scrapy, and go through the tutorials first, and then bury yourself into the rest of the documentation, you will not only start to understand the different parts to Scrapy better, but you will appreciate why it does what it does the way it does it. It is a framework for writing spiders and screen scrappers in the real sense of a framework. You will still have to learn XPath, but I find that it is best to learn it regardless. After all, you do intend to scrape websites, and an understanding of what XPath is and how it works is only going to make things easier for you.

Once you have, for example, understood the concept of pipelines in Scrapy, you will be able to appreciate how easy it is to do all sorts of stuff with scrapped items, including storing them into a database.

BeautifulSoup is a wonderful Python library that can be used to scrape websites. But, in contrast to Scrapy, it is not a framework by any means. For smaller projects where you don't have to invest time in writing a proper spider and have to deal with scrapping a good amount of data, you can get by with BeautifulSoup. But for anything else, you will only begin to appreciate the sort of things Scrapy provides.

ayaz
  • 9,968
  • 5
  • 31
  • 48
  • Thats a good answer ayaz. Thank you. I will visit the scrapy docs again tomorow (In Denmark its past midnight already) – Andreas Dec 01 '10 at 23:50
11

Looks like Scrappy is using XPATH for DOM traversal, which is a language itself and may feel somewhat cryptic for some time. I think BeautifulSoup will give you a faster start. With lxml you'll have to invest more time learning, but it generally considered (not only by me) a better alternative to BeautifulSoup.

For database I would suggest you to start with SQLite and use it until you hit a wall and need something more scalable (which may never happen, depending on how far you want to go with that), at which point you'll know what kind of storage you need. Mongodb is definitely overkill at this point, but getting comfortable with SQL is a very useful skill.

Here is a five-line example I gave some time ago to illustrate hoe BeautifulSoup can be used. Which is the best programming language to write a web bot?

Community
  • 1
  • 1
cababunga
  • 2,975
  • 12
  • 23
  • thats a very cool 5-line example. As per ayaz' answer I think a framework solution might be the way fwd for me - but for simple jobs on simple webpages, your example is simple sweet. Thank you. Also thankyou for the DB advice. – Andreas Dec 01 '10 at 23:51
  • cababunga: I decided to accept ayaz answer, but really it was you ans ayaz together that made me go for scrapy - you each gave different reasons. And I am very happy with your beautifulsoup example. – Andreas Dec 02 '10 at 14:33
  • lxml considered a better alternative to BeautifulSoup? I've used lxml a bit and BeautifulSoup a lot and I find BeautifulSoup much more friendly to use. Granted, it doesn't have the compactness of XPaths, but it's marvellous to work with. And because you're *really* working in Python, some things not possible with XPaths become much simpler in BeautifulSoup than with lxml. – Chris Morgan Dec 03 '10 at 04:31
3

I really like BeautifulSoup. I'm fairly new to Python but found it fairly easy to start screen scraping. I wrote a brief tutorial on screen scraping with beautiful soup. I hope it helps.

Omer Khan
  • 31
  • 1
  • 3
    Your tutorial is so brief that you should include it here as an answer. –  Sep 27 '12 at 16:02
2

Per the database part of the question, use the right tool for the job. Figure out what you wanna do, how you wanna organize your data, what kind of access you need, etc. THEN decide if a no-sql solution works for your project.

I think no-sql solutions are here to stay for a variety of different applications. We've implemented them on various projects I've worked on in the last 20 years inside of SQL databases without dubbing it no-sql so the applications exist. So it's worth at least getting some background on what they offer and which products are working well to date.

Design your project well, and keep the persistence layer separate, and you should be able to change your database solution with only minor heartache if you decide that's what's necessary.

Marvo
  • 16,808
  • 8
  • 46
  • 69
2

I recommend starting lower level while learning - scrapy is a high level framework. Read a good Python book like Dive Into Python then look at lxml for parsing HTML.

hoju
  • 24,959
  • 33
  • 122
  • 169
0

before diving into Scrapy take Udacity's introduction to Computer Science: https://www.udacity.com/course/cs101

That's a great way to familiarize yourself with Python and you will actually learn Scrapy lot faster once you have some basic knowledge of Python.

Jaakko
  • 564
  • 6
  • 13