Need a Scraper script for a Job site

Posted By: joe_vimal ()
Posted On: 2005-Sep-30 08:32

Anyone heard of a scraper script which extracts job ads from many sites and dumps them into a database ?

I could see tons of other scripts in all the usual places, but not one along this line. Writing a script from the scratch seems daunting. Any help would be much appreciated.


Posted By: masidani ()
Posted On: 2005-Oct-24 08:24

Joe,

I think it would have to be custom-written, I'm afraid. The reason is that the script needs to know how the HTMl code on each of the sites is written in order to know where to find the job data in the HTML.

If you visit each of the job sites yourself and look at the HTML source code, you'll see that each one is different. The "screen scraper" program will need to know where to look in each page to find things like job title, salary, location etc., which will be different in each case. Hence it will need to be custom-written.

That said, a Perl program with LWP::Useragent library and a few regular expressions will suffice, so long as there are no login/registration procedures etc. that need to be dealt with.

Simon


Posted By: joe_vimal ()
Posted On: 2005-Oct-26 16:02

Thanks Simon. I was afraid I would have to start from the beginning. There are other issues too. Will I be infringing on some copy right law if the script scrapes a couple of lines from many sites ?




Posted By: bhartzer (Staff)
Posted On: 2005-Oct-26 17:44

Will I be infringing on some copy right law

Yes.


Posted By: joe_vimal ()
Posted On: 2005-Oct-27 08:22

Thanks bhartzer. I knew something like this would happen. Ok. I have read somewhere that if you quote a couple of lines from any site in your site and use appropriate credit, you will not be hauled up for violation of copyright. Is this true ?

I am sorry I am asking this in a Perl forum.


Posted By: lizardz ()
Posted On: 2005-Oct-27 20:00

Use of a few lines is fair use I believe, that's not copyright infringement.

That's why you can quote somebody's writing for example, but not duplicate their whole article, but you can quote from an article.


Posted By: excell (Staff)
Posted On: 2005-Oct-27 20:04

a scraper script - automation of the process of taking content...yes I would be careful with what you create.


Posted By: joe_vimal ()
Posted On: 2005-Oct-28 06:43

No way excell. I perfectly understand and abhor the stealing of content from others. But what I am interested is - we want to populate the database of a jobsite with enough job offers to make the site attractive for the job seekers. Our client does not wish to infringe any laws and we won't either.

Scraping a line of content from other sites is perfectly acceptable if you don't overdo it. eg: For SEO purposes, many scrape the search results pages of search engines:

Results 1 - 100 of about 3,640,000 for 'keyword'

Same way we use snippets of information from weather sites too usually with the express consent from the webmasters.

In our case, even a couple of lines might be frowned upon as the snippet of imformation has some commercial value.

I am confused. We don't want to be associated with any route that will even remotely land us in trouble. Losing this client in such a case would be preferable. What is the consensus of the Ladies and Gentlemen here ?