Stories
Slash Boxes
Comments

News for nerds, stuff that matters

How To Build a Web Spider On Linux

Posted by kdawson on Wed Nov 15, 2006 02:13 AM
from the five-eyes dept.
IdaAshley writes, "Web spiders are software agents that traverse the Internet gathering, filtering, and potentially aggregating information for a user. This article shows you how to build spiders and scrapers for Linux to crawl a Web site and gather information, stock data, in this case. Using common scripting languages and their collection of Web modules, you can easily develop Web spiders."
This discussion has been archived. No new comments can be posted.
Display Options Threshold:
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • Hmm... (Score:5, Funny)

    by joe_cot (1011355) on Wednesday November 15 2006, @02:15AM (#16849238)
    (http://www.joeterranova.net/)
    Yes, but does it run on ... damn.
    • Re:Hmm... by martin-boundary (Score:2) Wednesday November 15 2006, @06:05AM
    • Re:Hmm... by Fordiman (Score:2) Wednesday November 15 2006, @09:00AM
      • Re:Hmm... by moro_666 (Score:2) Wednesday November 15 2006, @11:18AM
        • Re:Hmm... by chromatic (Score:1) Wednesday November 15 2006, @02:43PM
          • Re:Hmm... by moro_666 (Score:2) Tuesday November 21 2006, @04:34AM
        • Re:Hmm... by try_anything (Score:2) Wednesday November 15 2006, @03:56PM
        • Re:Hmm... by Fordiman (Score:2) Thursday November 16 2006, @09:15PM
        • Re:Hmm... by RLatimer (Score:1) Saturday November 18 2006, @10:40PM
      • Re:Hmm... by strstrep (Score:3) Wednesday November 15 2006, @11:37AM
    • Re:Hmm... by lpcustom (Score:2) Wednesday November 15 2006, @08:07AM
    • 1 reply beneath your current threshold.
  • Crawling efficiently (Score:5, Informative)

    by BadAnalogyGuy (945258) <BadAnalogyGuy@gmail.com> on Wednesday November 15 2006, @02:21AM (#16849262)
    Their example of a web crawler uses a queue to hold links. Since a link may appear twice, they use a lookup to scan the queue to see if the link is already loaded, and discard it if so.

    Better to use an associative array to cache the links since lookup is O(1). The Queue's lookup time is O(n) and if n gets large, so does the lookup time, not to mention that since you are checking each link the worst case scenario is a lookup time of O(n^2). A hash (associative array) will perform the same check in O(n). /([\W_\-]@\W+)/gs
  • The 90s called (Score:5, Funny)

    by dave562 (969951) on Wednesday November 15 2006, @02:21AM (#16849264)
    They want their technology back.
    • Re:Obligatory by poormanjoe (Score:1) Wednesday November 15 2006, @02:50AM
      • Re:Obligatory by k33l0r (Score:2) Wednesday November 15 2006, @06:58AM
        • Re:Obligatory by Bloke down the pub (Score:1) Wednesday November 15 2006, @07:45AM
        • Re:Obligatory by manastungare (Score:1) Wednesday November 15 2006, @09:19AM
        • Re:Obligatory by tehcyder (Score:1) Thursday November 16 2006, @06:14AM
    • 1 reply beneath your current threshold.
  • What's the point? (Score:2)

    by XorNand (517466) * on Wednesday November 15 2006, @02:23AM (#16849274)
    Why would anyone have a need to write a simple spider nowadays? In 2006, there has to be a better way than just following links. For example, it would be interesting to see something that crawled the various social bookmarking sites and corelated the various terms. For example, User A on Delicious and User B on Stumble Upon both bookmark a link about Pink Floyd and another one about Led Zep. If I'm searching for something about Floyd, the system could recommend some cool info about Led Zep too. (Email me if you need to know where to send my royality checks).
    • Re:What's the point? by Anonymous Coward (Score:1) Wednesday November 15 2006, @02:59AM
    • Actually... (Score:4, Interesting)

      by SanityInAnarchy (655584) <ninja@slaphack.com> on Wednesday November 15 2006, @03:52AM (#16849602)
      (Last Journal: Tuesday October 30, @10:59AM)
      Some websites do not have good search functionality. Sometimes it's an area that Google doesn't crawl (robots.txt and such), and sometimes I'm looking for something very, very specific.

      Regardless, I do, in fact, build spiders. For instance, in an MMO I play, all users can have webpages, so it's very useful to have a spider as part of a clan/guild/whatever to crawl the webpages looking for users who have illegal items and such. In a more general way, there is a third-party site which collects vital statistics of everyone who puts those in their user page, so you can get lists of the most powerful people in the game, the richest people, etc.
      [ Parent ]
      • Re:Actually... by pan_piper (Score:1) Wednesday November 15 2006, @01:28PM
        • Re:Actually... by SanityInAnarchy (Score:2) Wednesday November 15 2006, @02:47PM
    • Re:What's the point? by The_Wilschon (Score:2) Wednesday November 15 2006, @08:49AM
    • Re:What's the point? by brunascle (Score:1) Wednesday November 15 2006, @10:47AM
    • Re:What's the point? by try_anything (Score:2) Wednesday November 15 2006, @04:25PM
    • 1 reply beneath your current threshold.
  • downloads (Score:5, Informative)

    by Bananatree3 (872975) on Wednesday November 15 2006, @02:30AM (#16849298)

    for those of us who don't have them, here are the basics:



    Wget: http://www.gnu.org/software/wget/ [gnu.org].

    Curl http://curl.haxx.se/ [curl.haxx.se]
  • Hardly linux-specific (Score:5, Insightful)

    by h_benderson (928114) on Wednesday November 15 2006, @02:57AM (#16849384)
    All my love for linux aside, this has to do nothing with linux, the kernel (or even the GNU/Linux, the OS). It works just as well on any other unix-derivate or even windows.
  • some points (Score:5, Interesting)

    by cucucu (953756) on Wednesday November 15 2006, @02:59AM (#16849396)
    • Don't forget to check and respect robots.txt [robotstxt.org]. Python [python.org] has a module [python.org] that helps you parse that file
    • Samie [sourceforge.net] and its Python port Pamie [sourceforge.net] are your friends. You can automate IE so your script is treated as an human and not discriminated as a robot.
    • I use such beasts to do one-click time reporting at work and one-click cartoon collecting in my favorite newspaper.
    • And once I even repeatedly voted on an online poll and changed the course of history.
    • Ah, yes, TFA was about building a spider on Linux. I didn't check if my one-click IE scripts work on IE/Wine/Linux.
    • If I write an one-click script for online shopping, does it infringe the infamous Amazon patent?
    • When will Firefox's automation capabilities match those of IE?
    • Re:some points by coaxial (Score:1) Wednesday November 15 2006, @03:41AM
    • Re:some points by SanityInAnarchy (Score:2) Wednesday November 15 2006, @03:49AM
    • Re:some points by VGPowerlord (Score:2) Wednesday November 15 2006, @03:53AM
      • Re:some points by cucucu (Score:1) Wednesday November 15 2006, @04:06AM
        • 1 reply beneath your current threshold.
    • Re:some points (Score:4, Informative)

      by killjoe (766577) on Wednesday November 15 2006, @04:09AM (#16849694)
      "When will Firefox's automation capabilities match those of IE?"

      It's always had it. Look up XUL some day. The entire browser is written in xul.
      [ Parent ]
    • Re:some points by Gr8Apes (Score:1) Wednesday November 15 2006, @09:15AM
    • Re:some points by IchBinEinPenguin (Score:2) Wednesday November 15 2006, @05:25PM
    • Re:some points by jdigriz (Score:2) Thursday November 16 2006, @07:23PM
    • 1 reply beneath your current threshold.
  • by Channard (693317) on Wednesday November 15 2006, @03:23AM (#16849466)
    Dammit, I was hoping this was article was about the evolution of Dr Weird's phone spiders, mechanical creatures that could be sent down your cable line to maul anyone sending you phishing emails and spam.
  • Oh sweet Jesus! (Score:3, Insightful)

    by msormune (808119) on Wednesday November 15 2006, @03:25AM (#16849474)
    Pull the article out. The last thing we need is more indexing bots.
  • crawling is not so trivial (Score:2, Interesting)

    by cucucu (953756) on Wednesday November 15 2006, @03:33AM (#16849502)
    As the two students who started a little web search company, crawling the web is not trivial: http://infolab.stanford.edu/~backrub/google.html [stanford.edu]. An excerpt follows.


    Running a web crawler is a challenging task. There are tricky performance and reliability issues and even more importantly, there are social issues. Crawling is the most fragile application since it involves interacting with hundreds of thousands of web servers and various name servers which are all beyond the control of the system.

    In order to scale to hundreds of millions of web pages, Google has a fast distributed crawling system. A single URLserver serves lists of URLs to a number of crawlers (we typically ran about 3). Both the URLserver and the crawlers are implemented in Python. Each crawler keeps roughly 300 connections open at once. This is necessary to retrieve web pages at a fast enough pace. At peak speeds, the system can crawl over 100 web pages per second using four crawlers. This amounts to roughly 600K per second of data. A major performance stress is DNS lookup. Each crawler maintains a its own DNS cache so it does not need to do a DNS lookup before crawling each document. Each of the hundreds of connections can be in a number of different states: looking up DNS, connecting to host, sending request, and receiving response. These factors make the crawler a complex component of the system. It uses asynchronous IO to manage events, and a number of queues to move page fetches from state to state.

    It turns out that running a crawler which connects to more than half a million servers, and generates tens of millions of log entries generates a fair amount of email and phone calls. Because of the vast number of people coming on line, there are always those who do not know what a crawler is, because this is the first one they have seen. Almost daily, we receive an email something like, "Wow, you looked at a lot of pages from my web site. How did you like it?" There are also some people who do not know about the robots exclusion protocol, and think their page should be protected from indexing by a statement like, "This page is copyrighted and should not be indexed", which needless to say is difficult for web crawlers to understand. Also, because of the huge amount of data involved, unexpected things will happen. For example, our system tried to crawl an online game. This resulted in lots of garbage messages in the middle of their game! It turns out this was an easy problem to fix. But this problem had not come up until we had downloaded tens of millions of pages. Because of the immense variation in web pages and servers, it is virtually impossible to test a crawler without running it on large part of the Internet. Invariably, there are hundreds of obscure problems which may only occur on one page out of the whole web and cause the crawler to crash, or worse, cause unpredictable or incorrect behavior. Systems which access large parts of the Internet need to be designed to be very robust and carefully tested. Since large complex systems such as crawlers will invariably cause problems, there needs to be significant resources devoted to reading the email and solving these problems as they come up.
  • Quality of article? (Score:2, Insightful)

    by interp (815933) on Wednesday November 15 2006, @03:36AM (#16849514)
    I've never programmed in Ruby, but I think the comment in Listing 1 says it all:
    "Iterate through response hash"

    Why would somebody want to do that?
    A quick net search "reveals": A simple resp["server"] is all you need.
    Maybe the article was meant to be posted on thedailywtf.com?
    • 1 reply beneath your current threshold.
  • Re-inventing a square wheel (Score:5, Insightful)

    by rduke15 (721841) <rduke15&gmail,com> on Wednesday November 15 2006, @03:48AM (#16849582)

    Basically, the article gives you ruby and python examples of how to get web pages, and (badly) parse them for information. The same thing everyone has been doing for at least a decade with Perl and the appropriate modules, or whatever other tools, except that most know how to do it correctly.

    The first script is merely ridiculous: 12 lines of code (not counting empty and comment lines) to do:

    HEAD slashdot.org | grep 'Server: '

    But it gets worse. To extract a quote from a page, the second script suggests this:

    stroffset = resp.body =~ /class="price">/
    subset = resp.body.slice(stroffset+14, 10)
    limit = subset.index('<')
    print ARGV[0] + " current stock price " + subset[0..limit-1] +
    " (from stockmoney.com)\n"

    You don't need to know ruby to see what it does: it looks for the first occurence of 'class="price">' and just takes the 10 characters that follow. The author obviously never used that sort of thing for more than a couple of days, or he would know how quickly that will break and spit out rubbish.

    Finally, there is a Python script. At first glance, it looks slightly better. It uses what appears to be the Python equivalent of HTML::Parse to get links. But a closer look reveals that, to find links, it just gets the first attribute of any a tag and uses that as the link. Never mind if the 1st attribute doesn't happen to be "href".

    I suppose the only point of that article were the IBM links at the end:

    Order the SEK for Linux, a two-DVD set containing the latest IBM trial software for Linux from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.

    And that is in a section for Linux developers on the IBM site? Maybe the did copy stuff from SCO after all?...

    • Re:Re-inventing a square wheel by biffta (Score:1) Wednesday November 15 2006, @04:30AM
    • Re:Re-inventing a square wheel by kayditty (Score:1) Wednesday November 15 2006, @05:11AM
      • Re:Re-inventing a square wheel (Score:5, Insightful)

        by rduke15 (721841) <rduke15&gmail,com> on Wednesday November 15 2006, @05:42AM (#16849994)
        what exactly is HEAD slashdot.org

        It's a (perl) script which comes with libwww-perl [linpro.no] which either is now part of the standard Perl distribution, or is installed by default in any decent Linux distribution.

        If you don't have HEAD, you can type a bit more and get the server with LWP::Simple's head() method (then you don't need grep):

        $ perl -MLWP::Simple -e '$s=(head "http://slashdot.org/" )[4]; print $s'

        Either way is better than those useless 12 lines of ruby (I'm sure ruby can also do the same in a similarly simple way, but that author just doesn't have a clue)
        [ Parent ]
    • Okay kids... (Score:5, Informative)

      by Balinares (316703) on Wednesday November 15 2006, @05:30AM (#16849960)
      (http://slashdot.org/)
      Just so people who may come across this know, if you're going to do some HTML or XHTML parsing in Python, you'd be insane not to use BeautifulSoup [crummy.com] or a similar tool.

      Example to find all links in a document:
      from BeautifulSoup import BeautifulSoup
      for tag in BeautifulSoup(html_document).findAll("a"):
        print tag["href"]
      Yes, it's that simple. For an URL opener that also handles proxies, cookies, HTTP auth, SSL and so on, look into the urllib2 module that ships natively with Python.
      [ Parent ]
    • Re:Re-inventing a square wheel by matvei (Score:2) Wednesday November 15 2006, @06:07AM
    • Re:Re-inventing a square wheel by Bogtha (Score:2) Wednesday November 15 2006, @07:32AM
    • Re:Re-inventing a square wheel by DJDutcher (Score:1) Wednesday November 15 2006, @09:34AM
    • Re:Re-inventing a square wheel by ChaosDiscord (Score:2) Wednesday November 15 2006, @12:40PM
  • It's a trap! (Score:2, Funny)

    by radu.stanca (857153) <radu.stanca@gmail. c o m> on Wednesday November 15 2006, @04:11AM (#16849712)
    (http://www.starlog.ro/)
    Ah, I can see it clearly now!

    1. Post to Slashdot a decoy article(it includes Linux in the subjest) with new spam tricks
    2. Watch if spam increases 30% next days
    3. Bribe Cowboy Neal with 10G midget lesbian pr0n and get IP adresses of the art. readers
    4. Load shotgun and make the world a better place!
  • by xarak (458209) on Wednesday November 15 2006, @05:05AM (#16849886)

    I guess most male CS students will have coded something similar at least once to D/L pr0n.

    I did one in shell and one in TCL/TK.
  • User-Agent (Score:1, Troll)

    by Joebert (946227) on Wednesday November 15 2006, @05:36AM (#16849976)
    They forgot the set the User-Agent header to IE.
    • 1 reply beneath your current threshold.
  • by swm (171547) <swmcd@world.std.com> on Wednesday November 15 2006, @07:37AM (#16850502)
    (http://world.std.com/~swmcd/steven/)
    An app to find broken links on your web site.

    Checking links with LinkCheck
    http://world.std.com/~swmcd/steven/perl/pm/lc/link check.html [std.com]
  • Reinventing the wheel (Score:1, Interesting)

    by Anonymous Coward on Wednesday November 15 2006, @07:37AM (#16850504)
    I know, I know. Flame me. But I found Heritrix http://crawler.archive.org/ [archive.org] is a very polished package. Used it for my Masters research, and found that it is very extensible. Useful if you are doing real crawling, ie not concentrating on one site.
  • Incorrect Title (Score:2)

    by OneSmartFellow (716217) on Wednesday November 15 2006, @07:53AM (#16850604)

    Should be: "How Not ..."

    I don't think I am alone in my thinking
  • by praxis22 (681878) on Wednesday November 15 2006, @09:37AM (#16851742)
    (http://www.livejournal.com/users/praxis22)
    Nostarch press are releasing a book about this soon, they had a mockup on display at the Frankfurt book fair.
  • Nutch (Score:2)

    by Dante (3418) on Wednesday November 15 2006, @10:42AM (#16852672)
    (Last Journal: Wednesday March 23 2005, @05:01PM)
    Why not Nutch?

    http://lucene.apache.org/nutch/ [apache.org]
  • by johnpeb (940443) on Wednesday November 15 2006, @12:18PM (#16854476)
    (http://peberdy.ca/jp)
    Once i had to collect a lot of info from a website. I used java and wget and some java html parser library (possibly JTidy). anyway the code was very short and clean. I'd recommend DOM walking to other solutions when the data isn't trivial.
  • screen-scraper (Score:1)

    by toddcw (134666) on Wednesday November 15 2006, @01:25PM (#16855688)
    screen-scraper (http://www.screen-scraper.com/ [screen-scraper.com]) runs fabulously on Linux, and integrates well with most modern programming languages. It can save all kinds of time over writing Perl and Python scripts. There's a free (as in beer) version available, and a pro version if more features are wanted.
    • 1 reply beneath your current threshold.
  • I did similar things in college with Perl. (shudders*) The programs were OS-neutral; I think I developed mine in Windows under Cygwin.

    *Yes, I know Slashdot is written in Perl.

  • Re:Just what the internet needs... (Score:3, Informative)

    by ComaVN (325750) on Wednesday November 15 2006, @03:28AM (#16849486)
    I think that's robots.txt, *not* spider.txt
    [ Parent ]
  • by scdeimos (632778) on Wednesday November 15 2006, @03:46AM (#16849560)
    How does "spider.txt" get an Insightful when it's "robots.txt"? Sheesh, bump the Mods Roster.
    [ Parent ]
  • Re:I hate (Score:1)

    by kfg (145172) on Wednesday November 15 2006, @04:23AM (#16849748)
    Just because you're paranoid doesn't mean people aren't crawling your site.

    KFG
    [ Parent ]
  • 9 replies beneath your current threshold.