Printer Friendly Version
Email this thread to a friend
|
a few sites for sale (In: I Want to Sell My Website)
Robots crawling (In: Members Lounge)
Parsing of any sites in convenient form (grab, inf (In: Professionals Corner)
Choosing meta descriptions for eclectic sites (In: General Search Engine Optimization)
One site with many themes OR seperate theme sites? (In: General Search Engine Optimization)
Featured Web Site Template |
|
Reflects user activity within the last 5 minutes
|
|
| Member |
Message |
forumcrawler
Joined: Jun 04, 2002
# Posts: 1
|
Posted: 2002-Jun-04 16:08
Some site operators are complaining that I'm crawling their sites at too fast of a rate of requests and slowing down their sites. What is the proper netiquette, besides obeying robots.txt, for crawling a site (i.e. rate of crawling)?
|
 |
figment88
Joined: Feb 14, 2002
# Posts: 389
|
Posted: 2002-Jun-04 17:50
I'm not an expert in this area, but recently I've seen:1) I have a offsite search. One of the configuration options is a delay between requests. 2) Googlebot seems to have a limit on number of pages it grabs at once. I don't know if this is based on total pages or not. 3) Take notice of the time that you crawl. Most sites have their busiest traffic from 9-12am E.S.T. Try not to crawl during those hours, try to crawl during morning pst either. If you can limit crawling to weekends only.
|
 |
WEBStock
Joined: Apr 25, 2002
# Posts: 32
|
Posted: 2002-Jun-05 06:43
I believe googlebot DOES do its speed based on size of site (or at least that's a part of it). At the beginning of a crawl it sends a few hits out on one or two IP addresses and hits at about 1 per minute. As she finds more food (there are close to a million pages at my site) she sends out more and more spiders. By the end of the crawl (usually only between 20K and 40K of the pages, depending upon the planned "depth of the crawl" I suspect) I'm getting 200+ hits an hour from her. Then, she reaches her max time allowed (or max pages allowed) limit and goes away only to tag my front page every 3-5 days to see if it's fresh or not.G.
|
 |
Fishi
Joined: Dec 21, 1999
# Posts: 253
|
Posted: 2002-Jun-05 10:13
There should be a delay between two requests to the same site (host, ip, ..) of at least several seconds. The longer the better. Try to do a round robin crawling: Fetch a page from site1, then another one from site 2 and so on.If your crawler crawls forums - as your nick suggets - then you should be especially carefull, because forum sites are often driven by databases and every request can slow down the server.
|
 |
runarb
Joined: Eons Ago
# Posts: 1
|
Posted: 2003-Jun-02 10:57
The boitho crawler ( www.boitho.com ) use a 30 second delay between requests to the same server. The server that distributor the URL's to crawl to the crawlers uses a hash table with the latest crawled servers. If a URL is on a server that has had a request recently the URL is put in a queue.
|
 |
You are not permitted to post messages in this forum or topic, because of one or more of the following reasons:
- You have not yet logged in, or registered properly as a member
- You are a member, but no longer have posting rights.
- This is a private forum, for which you do not have permissions.
If you are a recent member, it's possible that you simply have not yet confirmed your account. Please
check your email for a message entitled 'JimWorld Forums: Confirm Your Account' and follow the instructions
contained within.
If you cannot find this message, click here to Re-Send it.
|
If you are still experiencing problem, please read the
Login Assistance
Article for some advice on what may be causing your login not to work properly.
|
Switch to Advanced Editor and ...
Create a New Topic
or Reply to this Thread
|
|