Search Engine Programming?

Posted By: colt45 ()
Posted On: 2004-Jun-10 18:03

Hi All,

I was needing help with a problem I am having and was hoping for some help?

How do I get my linking program which is written in php to spider an entire site starting from the root to find my link and stop once it does or must stop after some time, if the link is not found?

This is blowing my mind, any help will be greatly appreciated..

Thanks,
Colt45


Posted By: Sinoed ()
Posted On: 2004-Jul-04 13:49

Well what you're talking about is a script that would take a little bit of logic & time on your part to create. Essentially what you need to do is tell the spider to start searching for URL's. It would have to take in the input (aka. raw HTML if you're spidering the web) and find what it needs. In order to do that your spider would have to ignore everything but <a href=""> tags. Once it does that you need to tell it what to do if it finds a link - either it does something like displays a message or it continues on. If you were to create something along the lines of the following you could start by spidering one page. To spider multiple pages you'd have to save the URL's it found in a file or array. After it finishes each URL it would read a URL from the list or array delete it from the list and start the searching process again. I don't know whether PHP is capable of spawning multiple processes, I usually think of Java as a better choice for something like that.


Code: [copy]





Here is function from www.php.net which calculates the elapsed time:


Code: [copy]





It would help you to read a little bit about how a spider is written. There is a good article on it from Developer.com.

Anyways, hope that helps get your mind around it a little. :)


Posted By: mincklerstraat ()
Posted On: 2004-Sep-16 18:45

You could also try to hack an existing spider to do this work for you - phpdig comes to mind. However, this is a common enough sort of need for finding recips - maybe hotscripts would already have something, see if they have a 'link checker' or 'reciprocal link checker' section. I'd hate to start writing something like this from the start. What about all those DHTML and javascript links? What about redirects and all that? Yuk!