Stories
Slash Boxes
Comments

News for nerds, stuff that matters

Slashdot Log In

Log In

Create Account  |  Retrieve Password

How To Build a Web Spider On Linux

Posted by kdawson on Wed Nov 15, 2006 02:13 AM
from the five-eyes dept.
IdaAshley writes, "Web spiders are software agents that traverse the Internet gathering, filtering, and potentially aggregating information for a user. This article shows you how to build spiders and scrapers for Linux to crawl a Web site and gather information, stock data, in this case. Using common scripting languages and their collection of Web modules, you can easily develop Web spiders."
+ -
story
This discussion has been archived. No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More
Loading... please wait.
  • Hmm... (Score:5, Funny)

    by joe_cot (1011355) on Wednesday November 15 2006, @02:15AM (#16849238) Homepage
    Yes, but does it run on ... damn.
    • ... the internet?
    • So that's what they're called. I've been building them for years, both for personal data collection and for research for professors I work for (I have a couple of acknowledgements to this effect). I've been calling them 'site scrapers' and 'data reapers'.

      And I generally write 'em in PHP. Makes 'em nice and lightweight to redistribute (php.exe and php5ts.dll are usually all that's needed. Sometimes php_http.dll as well.)
      • You must have tons of time on your hands for those crawlers ....

        A modern crawler has to overcome very annoying problems like nslookup delays and network lags that are caused by a third party. If you can write it in a threaded environment, good for you, if you can drop the "single scope" at all and go for an select or even better, epolled version that can crawl thousand sites at a time, even better.

        For simple tasks even the ithreads of perl would do. But i'd suggest a language that supports
        • You must have tons of time on your hands for those crawlers ....

          You can make any program difficult hard by increasing the generality and performance requirements, but there's nothing inherently difficult about screen-scraping from a web site. I've written a few scripts to extract data from web sites, and they're quite simple if your aims are modest. The first crawler I wrote was also my first Perl project, my first time using HTTP, and my first time dealing with HTML. Given a date, it generated a URL,

      • Re: (Score:3, Interesting)

        PHP lightweight? Ha!

        The PHP interpreter is over 5 megabytes in size. And it isn't thread-safe. That's a lot of memory overhead for a program that's going to be blocking on I/O most of the time, seeing how you'll have to fork() a new process for each new "thread" you want.

        Also, languages like Perl and Python have binaries that are about 1 megabyte in size. Now, while they'll probably need to load in extra files for most practical applications, these extra files are typically small. Most importantly, Per
  • Crawling efficiently (Score:5, Informative)

    by BadAnalogyGuy (945258) <BadAnalogyGuy@gmail.com> on Wednesday November 15 2006, @02:21AM (#16849262)
    Their example of a web crawler uses a queue to hold links. Since a link may appear twice, they use a lookup to scan the queue to see if the link is already loaded, and discard it if so.

    Better to use an associative array to cache the links since lookup is O(1). The Queue's lookup time is O(n) and if n gets large, so does the lookup time, not to mention that since you are checking each link the worst case scenario is a lookup time of O(n^2). A hash (associative array) will perform the same check in O(n). /([\W_\-]@\W+)/gs
    • Re: (Score:2, Insightful)

      by Anonymous Coward
      Python has a builtin set type. Have no idea why they did not use it.
      • Re: (Score:3, Interesting)

        Maybe because they don't know the first thing about efficiency? You'd be surprised how much programmers don't know/care about efficiency. Once, incidentilly also on a crawler (student project), I improved the function reading a tree of URL's from 1 hour(!) to 0.1second! The guy tested it on an example with 10 URL's and it worked, but his implementation was O(n^2) and involved copying huge amounts of memory each step. Don't ask me how he thought this would be scalable.
        • Maybe because they don't know the first thing about efficiency? You'd be surprised how much programmers don't know/care about efficiency.


          If you're surprised about programmers not knowing/caring about efficiency, do you actually use a computer?
    • Tell me then; how do they make a O(1) FIFO queue out of the associative array?

      No, not really interested in the answer, as I'm just pointing out that the code suddenly becomes (unnecessarily) much more complicated.
    • I know. Also, I'm not exactly certain why they used Ruby.

      My favorite method is to use PHP as a backend for mshta; you can be guaranteed it'll run on any Windows machine, and you have the benefit that a linux machine will at least be able to run the back-end.
  • by dave562 (969951) on Wednesday November 15 2006, @02:21AM (#16849264) Journal
    They want their technology back.
      • Has there ever been a news story on Slashdot that doesn't have a "I, for one, welcome our new [Insert here] overlords" comment attached to it?

  • Why would anyone have a need to write a simple spider nowadays? In 2006, there has to be a better way than just following links. For example, it would be interesting to see something that crawled the various social bookmarking sites and corelated the various terms. For example, User A on Delicious and User B on Stumble Upon both bookmark a link about Pink Floyd and another one about Led Zep. If I'm searching for something about Floyd, the system could recommend some cool info about Led Zep too. (Email me if
    • Actually... (Score:4, Interesting)

      by SanityInAnarchy (655584) <ninja@slaphack.com> on Wednesday November 15 2006, @03:52AM (#16849602) Journal
      Some websites do not have good search functionality. Sometimes it's an area that Google doesn't crawl (robots.txt and such), and sometimes I'm looking for something very, very specific.

      Regardless, I do, in fact, build spiders. For instance, in an MMO I play, all users can have webpages, so it's very useful to have a spider as part of a clan/guild/whatever to crawl the webpages looking for users who have illegal items and such. In a more general way, there is a third-party site which collects vital statistics of everyone who puts those in their user page, so you can get lists of the most powerful people in the game, the richest people, etc.
        • Not in this game. The pages are not actually user-created, they are generated daily from the actual game data and hosted by the company who runs the game. The only thing you have control over is whether your inventory/bank/whatever appears on the page, and I can just as easily scan for people who refuse to list them.

          So, it's actually much more efficient to scan for a specific string that I know will be there for a particular item -- it's literally impossible for them to try to mask it with, say, leetspeak.
    • One might want to study social networks. What better way to do this than to make a graph (as in the nodes and edges type) of myspace or facebook and study that? How are you going to do this? Well, seems like making a spider would be a quite sensible way.
    • Why would anyone have a need to write a simple spider nowadays?

      You're right. Web 2.0 changes everything. Some people are just conservative, though. My parents are still using bookshelves even though maglev trains made bookshelves obsolete decades ago.

      In 2006, there has to be a better way than just following links. For example, it would be interesting to see something that crawled the various social bookmarking sites and corelated the various terms.

      You mean follow links and *gasp* do something with t

  • downloads (Score:5, Informative)

    by Bananatree3 (872975) on Wednesday November 15 2006, @02:30AM (#16849298)

    for those of us who don't have them, here are the basics:



    Wget: http://www.gnu.org/software/wget/ [gnu.org].

    Curl http://curl.haxx.se/ [curl.haxx.se]
  • by h_benderson (928114) on Wednesday November 15 2006, @02:57AM (#16849384)
    All my love for linux aside, this has to do nothing with linux, the kernel (or even the GNU/Linux, the OS). It works just as well on any other unix-derivate or even windows.
  • some points (Score:5, Interesting)

    by cucucu (953756) on Wednesday November 15 2006, @02:59AM (#16849396)
    • Don't forget to check and respect robots.txt [robotstxt.org]. Python [python.org] has a module [python.org] that helps you parse that file
    • Samie [sourceforge.net] and its Python port Pamie [sourceforge.net] are your friends. You can automate IE so your script is treated as an human and not discriminated as a robot.
    • I use such beasts to do one-click time reporting at work and one-click cartoon collecting in my favorite newspaper.
    • And once I even repeatedly voted on an online poll and changed the course of history.
    • Ah, yes, TFA was about building a spider on Linux. I didn't check if my one-click IE scripts work on IE/Wine/Linux.
    • If I write an one-click script for online shopping, does it infringe the infamous Amazon patent?
    • When will Firefox's automation capabilities match those of IE?
    • You don't want to automate IE. Aside from the fact that it's IE, you don't want to use any browser unless you have to. Mechanize is your friend, and you can always change the user agent string if you want to be a jackass.

      Firefox's automation capabilities don't need to match those of IE, for pretty much the same reason. The only thing Mechanize can't do is JavaScript, and there are vague plans about that.
    • Are you sure that automation still works in IE7?
    • Re:some points (Score:4, Informative)

      by killjoe (766577) on Wednesday November 15 2006, @04:09AM (#16849694)
      "When will Firefox's automation capabilities match those of IE?"

      It's always had it. Look up XUL some day. The entire browser is written in xul.
  • Dammit, I was hoping this was article was about the evolution of Dr Weird's phone spiders, mechanical creatures that could be sent down your cable line to maul anyone sending you phishing emails and spam.
  • Oh sweet Jesus! (Score:3, Insightful)

    by msormune (808119) on Wednesday November 15 2006, @03:25AM (#16849474)
    Pull the article out. The last thing we need is more indexing bots.
  • As the two students who started a little web search company, crawling the web is not trivial: http://infolab.stanford.edu/~backrub/google.html [stanford.edu]. An excerpt follows.

    Running a web crawler is a challenging task. There are tricky performance and reliability issues and even more importantly, there are social issues. Crawling is the most fragile application since it involves interacting with hundreds of thousands of web servers and various name servers which are all beyond the control of the system.

    In order to sca

    • Also, because of the huge amount of data involved, unexpected things will happen. For example, our system tried to crawl an online game. This resulted in lots of garbage messages in the middle of their game! It turns out this was an easy problem to fix.

      Unfortunately, many web developers still ignore the inevitable, leaving their sites vulnerable to the dreaded Googlebot "attack". While most of the spider developer manuals (TFA included) stress the importance of being polite (respect robots.txt & friend

  • I've never programmed in Ruby, but I think the comment in Listing 1 says it all:
    "Iterate through response hash"

    Why would somebody want to do that?
    A quick net search "reveals": A simple resp["server"] is all you need.
    Maybe the article was meant to be posted on thedailywtf.com?
  • by rduke15 (721841) <rduke15@gmai l . c om> on Wednesday November 15 2006, @03:48AM (#16849582)

    Basically, the article gives you ruby and python examples of how to get web pages, and (badly) parse them for information. The same thing everyone has been doing for at least a decade with Perl and the appropriate modules, or whatever other tools, except that most know how to do it correctly.

    The first script is merely ridiculous: 12 lines of code (not counting empty and comment lines) to do:

    HEAD slashdot.org | grep 'Server: '

    But it gets worse. To extract a quote from a page, the second script suggests this:

    stroffset = resp.body =~ /class="price">/
    subset = resp.body.slice(stroffset+14, 10)
    limit = subset.index('<')
    print ARGV[0] + " current stock price " + subset[0..limit-1] +
    " (from stockmoney.com)\n"

    You don't need to know ruby to see what it does: it looks for the first occurence of 'class="price">' and just takes the 10 characters that follow. The author obviously never used that sort of thing for more than a couple of days, or he would know how quickly that will break and spit out rubbish.

    Finally, there is a Python script. At first glance, it looks slightly better. It uses what appears to be the Python equivalent of HTML::Parse to get links. But a closer look reveals that, to find links, it just gets the first attribute of any a tag and uses that as the link. Never mind if the 1st attribute doesn't happen to be "href".

    I suppose the only point of that article were the IBM links at the end:

    Order the SEK for Linux, a two-DVD set containing the latest IBM trial software for Linux from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.

    And that is in a section for Linux developers on the IBM site? Maybe the did copy stuff from SCO after all?...

    • Okay kids... (Score:5, Informative)

      by Balinares (316703) on Wednesday November 15 2006, @05:30AM (#16849960)
      Just so people who may come across this know, if you're going to do some HTML or XHTML parsing in Python, you'd be insane not to use BeautifulSoup [crummy.com] or a similar tool.

      Example to find all links in a document:
      from BeautifulSoup import BeautifulSoup
      for tag in BeautifulSoup(html_document).findAll("a"):
        print tag["href"]
      Yes, it's that simple. For an URL opener that also handles proxies, cookies, HTTP auth, SSL and so on, look into the urllib2 module that ships natively with Python.
      • I couldn't agree more. The author also neglects to use useful standard functions like urllib.urlopen, instead building his own HTTP downloader function. He'd also do well to use urlparse.urljoin to turn a relative href attribute into an absolute URL, and urlparse.urlparse to check things like the protocol and host.

        For example:

        from BeautifulSoup import BeautifulSoup
        from urllib import urlopen
        from urlparse import urljoin, urlparse

        visited_urls = set()
        url_stack = []

        for tag in BeautifulSoup(urlop

      • I couldn't resist - in Ruby, using the beautiful (but much understated) hpricot [whytheluckystiff.net] library:

        doc = Hpricot(open(html_document))
        (doc/"a").each { |a| puts a.attributes['href'] }

        Check it out - I've been using it for a project, and it's really fast and really easy to use (supports both xpath and css for parsing links). For spidering you should check out the ruby mechanize [rubyforge.org] library (which is like perl's www-mechanize, but also uses hpricot, making parsing the returned document much easier).

    • Finally, there is a Python script. At first glance, it looks slightly better. It uses what appears to be the Python equivalent of HTML::Parse to get links. But a closer look reveals that, to find links, it just gets the first attribute of any a tag and uses that as the link. Never mind if the 1st attribute doesn't happen to be "href".

      What bugs me the most about this article is that the author keeps using the most generic libraries he can find instead of something written for this exact task. He should h

    • The first script is merely ridiculous: 12 lines of code (not counting empty and comment lines) to do:

      HEAD slashdot.org | grep 'Server: '

      This code won't catch 404s and other errors. Theirs will. Furthermore, assuming the Ruby library is conformant, their code can deal with multi-line headers, while yours would break.

      Things like grep aren't suitable for parsing HTTP responses. You might get results for simple cases, but there are all kinds of corner cases out there that require a proper script

    • Indeed. For most of my simple spidering needs I've found Perl's WWW::Mechanize to be a dream. I say what I mean: go get this page, find a link labeled "Today's Story" and follow it, on the resulting page find the second form and fill in the username and password fields with $username and $password, click submit, return the resulting page. I've found it useful for scraping sites with regular updates that have unpredictable URLs but constant links. Perl.com's "Screen-scraping with WWW::Mechanize [perl.com]" is a good
      • by rduke15 (721841) <rduke15@gmai l . c om> on Wednesday November 15 2006, @05:42AM (#16849994)
        what exactly is HEAD slashdot.org

        It's a (perl) script which comes with libwww-perl [linpro.no] which either is now part of the standard Perl distribution, or is installed by default in any decent Linux distribution.

        If you don't have HEAD, you can type a bit more and get the server with LWP::Simple's head() method (then you don't need grep):

        $ perl -MLWP::Simple -e '$s=(head "http://slashdot.org/" )[4]; print $s'

        Either way is better than those useless 12 lines of ruby (I'm sure ruby can also do the same in a similarly simple way, but that author just doesn't have a clue)
        • Hah, I must have written at least a few dozen lib-www scripts, but I didn't know about HEAD.

          I always used lynx -source -head http://slashdot.org/ [slashdot.org] wish is a lot more typing...

          Thanks,
          X.
  • Ah, I can see it clearly now!

    1. Post to Slashdot a decoy article(it includes Linux in the subjest) with new spam tricks
    2. Watch if spam increases 30% next days
    3. Bribe Cowboy Neal with 10G midget lesbian pr0n and get IP adresses of the art. readers
    4. Load shotgun and make the world a better place!
  • An app to find broken links on your web site.

    Checking links with LinkCheck
    http://world.std.com/~swmcd/steven/perl/pm/lc/link check.html [std.com]

  • Should be: "How Not ..."

    I don't think I am alone in my thinking
  • Why not Nutch?

    http://lucene.apache.org/nutch/ [apache.org]