Printer Friendly Version Print this thread
Email this thread to a friend eMail this thread to a friend
  • a few sites for sale (In: I Want to Sell My Website)
  • Robots crawling (In: Members Lounge)
  • Parsing of any sites in convenient form (grab, inf (In: Professionals Corner)
  • Choosing meta descriptions for eclectic sites (In: General Search Engine Optimization)
  • One site with many themes OR seperate theme sites? (In: General Search Engine Optimization)
  • Featured Web Site Template

    Hundreds More at Free Site Templates.com!

    Web Site Partners
    Sponsored Links
    Jet City Software
     
    Whos Here ?
    Reflects user activity within the last 5 minutes
    Moderator(s): OAC, Dinkar
    Member Message

    forumcrawler
    Joined: Jun 04, 2002
    # Posts: 1

    View the profile for forumcrawler Send forumcrawler a private message

    Posted: 2002-Jun-04 16:08
    Edit Message Delete Message Reply to this message

    Some site operators are complaining that I'm crawling their sites at too fast of a rate of requests and slowing down their sites. What is the proper netiquette, besides obeying robots.txt, for crawling a site (i.e. rate of crawling)?



    figment88
    Joined: Feb 14, 2002
    # Posts: 389

    View the profile for figment88 Send figment88 a private message

    Posted: 2002-Jun-04 17:50
    Edit Message Delete Message Reply to this message

    I'm not an expert in this area, but recently I've seen:

    1) I have a offsite search. One of the configuration options is a delay between requests.

    2) Googlebot seems to have a limit on number of pages it grabs at once. I don't know if this is based on total pages or not.

    3) Take notice of the time that you crawl. Most sites have their busiest traffic from 9-12am E.S.T. Try not to crawl during those hours, try to crawl during morning pst either. If you can limit crawling to weekends only.



    WEBStock
    Joined: Apr 25, 2002
    # Posts: 32

    View the profile for WEBStock Send WEBStock a private message

    Posted: 2002-Jun-05 06:43
    Edit Message Delete Message Reply to this message

    I believe googlebot DOES do its speed based on size of site (or at least that's a part of it). At the beginning of a crawl it sends a few hits out on one or two IP addresses and hits at about 1 per minute. As she finds more food (there are close to a million pages at my site) she sends out more and more spiders. By the end of the crawl (usually only between 20K and 40K of the pages, depending upon the planned "depth of the crawl" I suspect) I'm getting 200+ hits an hour from her. Then, she reaches her max time allowed (or max pages allowed) limit and goes away only to tag my front page every 3-5 days to see if it's fresh or not.

    G.



    Fishi
    Joined: Dec 21, 1999
    # Posts: 253

    View the profile for Fishi Send Fishi a private message

    Posted: 2002-Jun-05 10:13
    Edit Message Delete Message Reply to this message

    There should be a delay between two requests to the same site (host, ip, ..) of at least several seconds. The longer the better. Try to do a round robin crawling: Fetch a page from site1, then another one from site 2 and so on.

    If your crawler crawls forums - as your nick suggets - then you should be especially carefull, because forum sites are often driven by databases and every request can slow down the server.



    runarb
    Joined: Eons Ago
    # Posts: 1

    View the profile for runarb Send runarb a private message

    Posted: 2003-Jun-02 10:57
    Edit Message Delete Message Reply to this message

    The boitho crawler ( www.boitho.com ) use a 30 second delay between requests to the same server. The server that distributor the URL's to crawl to the crawlers uses a hash table with the latest crawled servers. If a URL is on a server that has had a request recently the URL is put in a queue.


    You are not permitted to post messages in this forum or topic, because of one or more of the following reasons:
    1. You have not yet logged in, or registered properly as a member
    2. You are a member, but no longer have posting rights.
    3. This is a private forum, for which you do not have permissions.

    If you are a recent member, it's possible that you simply have not yet confirmed your account. Please check your email for a message entitled 'JimWorld Forums: Confirm Your Account' and follow the instructions contained within.

    If you cannot find this message, click here to Re-Send it.

    If you are still experiencing problem, please read the Login Assistance Article for some advice on what may be causing your login not to work properly.

    Switch to Advanced Editor and ... Create a New Topic or Reply to this Thread

    New posts Forum is locked
    © 1995  ·  iWeb, Inc  ·  DBA JimWorld Productions