Printer Friendly Version
Email this thread to a friend
|
Featured Web Site Template |
|
Reflects user activity within the last 5 minutes
|
|
| Member |
Message |
Sinoed
Joined: Dec 11, 2000
# Posts: 5266
|
Posted: 2002-Feb-18 14:42
Just looking at some Google results which have piqued my curiousity. When you search for pretty much anything on Google your search results are returned in micro-seconds. To be able to return that information that quickly is quite an accomplishment, obviously there is some skill in getting everything working so well together. Putting all of the actual hardware aside, how do you program a fast search engine? From a technical standpoint there must be certain features of the actual code that must be ignored because they would take up more time, like nested loops probably. Is there a particular language that is best for programming a search engine? What would your goals be when trying to max out your speed without crushing reliability?
|
 |
jcokos
Staff
Joined: Eons Ago
# Posts: 145
|
Posted: 2002-Feb-19 05:30
Remember, google runs on a "Beowulf Cluster" of servers, with over 20,000 machines at last count, all working as a single "supercomputer". None of us could ever compete with that type of power.A well written perl program, running under apache with mod_perl can achieve results in less than a second (ie: www.hitsgalore.ws) Poortly written perl programs can be hellaciously slow. Same goes for C... the general idea is that C is faster, but it really comes down to HOW the program was written, not necessarily the language. I can make perl outperform C 3:1, and can also do likewise the opposite way....
|
 |
jkcity
Joined: Mar 16, 2001
# Posts: 3230
|
Posted: 2002-Feb-19 05:36
"runs on a "Beowulf Cluster" of servers, with over 20,000 machines at last count, all working as a single "supercomputer"."I would love one of those, I don't know what I would do with it though, But it helps to explain why they are so fast.
|
 |
Sinoed
Joined: Dec 11, 2000
# Posts: 5266
|
Posted: 2002-Feb-18 23:43
Exactly. I couldn't put together a 20,000 unit super computer with a proprietary load balancing system either. However what I can do is to correctly program the software so that results aren't 'hellaciously slow'. That said, what do you do? What do you avoid? (Other than figuring out when its too late that you've created a mission critical flaw which is crippling your system.)
|
 |
jcokos
Staff
Joined: Eons Ago
# Posts: 145
|
Posted: 2002-Feb-19 16:37
I try to look at things with "delusions of grandeur" .... in my test environments, I have all of ODP (2.x million links) loaded, and I simulate a load 100,000 random searches per day (3,000,000 monthly). All on a single CPU sytem with only 256 MB of ram.My goal is to ensure that a moderatly configured system can handle a gigantic database, and a pretty good load, and still have an average return of < 1 second on any given search. How do you achieve this ? It comes down to database design, number one, and indexing algorithms. You start with a database that's designed to hold a lot of data, but still be indexed and indexable for searching. A well designed database can return 200 results from a 30 million keyword database in the same time it takes to return 1 result from a 2 record database. It all comes down to indexing, IMO. Starting with a solid DB Plan, you can rely on the SQL Engine to do it's job efficiently, and your Perl or C++ merely has to make the Query and spit out the results. John
|
 |
Sinoed
Joined: Dec 11, 2000
# Posts: 5266
|
Posted: 2002-Feb-20 00:52
How are you stimuating a load of 100,000 searches? Did you write a program to do that too?
|
 |
jcokos
Staff
Joined: Eons Ago
# Posts: 145
|
Posted: 2002-Feb-20 05:31
I found a web torture tester program about 2 years ago, and modded it a little bit.I can push it well beyond 100,000 if I want to, even simulate 100 or more simultaneous searches per second. Drop me an email, and i'll send you a copy. John
|
 |
Sinoed
Joined: Dec 11, 2000
# Posts: 5266
|
Posted: 2002-Feb-21 02:56
Cool, that would be great. Have you got any ideas (books or links) to share about writing efficient (aka. fast) indexing algorithms?
|
 |
rally
Joined: Mar 01, 2002
# Posts: 8
|
Posted: 2002-Mar-01 21:37
Jcokos, Is python worth looking into for a search engine/spider? also do you think something like mySQL is capable of indexing say about 150million pages, or would something like oracle have to be used for that?
|
 |
jcokos
Staff
Joined: Eons Ago
# Posts: 145
|
Posted: 2002-Mar-02 16:54
I don't know a lot about Python, other than the fact that it's a completely "OOP" language. That in and of itself lends itself to forcing the developer to write solid, stable, efficient code which is the first requirement to handling tons of requests. The main reason that I love perl so much is that there's numerous persistant pre-processors for it (CGISpeedy, fastCGI, and mod_perl) that allow for the programmer to plan for, and code internal cacheing, database persistancy, etc. I do hear great things about Python, but I dont' know about it's persistancy options, which I think are key.mySQL blows Oracle away. Period. I have a customer that runs both, and his mySQL version of our program (Jackhammer) can handle almost 3x the load. Searches are faster, connections are more stable. It's just a better system. The downside to mysql is that it doesn't support all of the newer SQL things like procedures and triggers, but pound for pound, it kicks serious butt. We've got a few Jackhammer users that are serving upwards of 120 million searches per month, without a problem. One is using mod_perl, the other SpeedyCGI. The SpeedyCGI seems to be much better in terms of the load and memory usage. Hope that helps.
|
 |
rally
Joined: Mar 01, 2002
# Posts: 8
|
Posted: 2002-Mar-04 05:14
Thanks a bunch for that john, really apppreciated. As I am new to these forums i have been going through alot of threads and posts in the last few days, I cam across this very interesting post of yours : http://searchengineforums.com/Forum27/HTML/000016.html Could you be so kind enough to elaborate on turbo linux and especially zeus, These are 2 are new to me as I was used to sun solaris/red hat linux. D these 2 work in combination or just on their own, also what makes zeus the 'rolls royce' of server OS's? thanks
|
 |
jcokos
Staff
Joined: Eons Ago
# Posts: 145
|
Posted: 2002-Mar-04 15:03
http://www.turbolinux.com/ http://www.zeus.com The thing about turbolinux that I love is that it's very tuned for speed, and very stripped down. Meaning, it's a TINY kernel, with no puff, and nothing you don't need to run a website (like X, etc). Zeus is an apache replacement that outperforms apache by a 3:1 margin. It's costly, but if your serious, very worth it. John
|
 |
rally
Joined: Mar 01, 2002
# Posts: 8
|
Posted: 2002-Mar-04 16:27
Wow, this turbolinux and zeus looks too go to be true  Turbolinux has apps that make beuwulf look antiquated. So where does a setup like this leave the likes of fastCGI and speedyCGI, can these be also added to the turbolinux/zeus setup? Will hyperseel and linksSQL run on this setup? thanks for the great info
|
 |
jcokos
Staff
Joined: Eons Ago
# Posts: 145
|
Posted: 2002-Mar-04 17:50
Zeus has fastCGI built into it, so that's handled for you Zeus doesn't support mod_perl or CGISpeedy, but with fastCGI built in, who cares?Hyperseek is fastCGI ready (and mod_perl & CGISpeedy ready), out of the box. Not sure about LinksSQL. I know that Alex loves mod_perl and LinksSQL runs under mod_perl quite well, but I don't know about the other speedup solutions, you'd have to ask him directly about them.
|
 |
rally
Joined: Mar 01, 2002
# Posts: 8
|
Posted: 2002-Mar-04 19:20
Aah interesting that zeus has fast CGI, now you say zeus is 3:1 compared to apache, how much of that is contributed by fastCGI, is it reall worth the extra investment for zeus, don't get me wrong I would never skimp on the tools that you have mentioned so far. To be honest in many of your post you mention it quite a lot about the actual foundatons should not be compromised which I whole heartedlt agree with you. Also is hyperseek OK wih turbiolinux? thanks again
|
 |
rally
Joined: Mar 01, 2002
# Posts: 8
|
Posted: 2002-Mar-04 19:29
Also adding speedyCGI on top of zeus/fastCGI would be really diminshing returns, correct?
|
 |
4dam W
Joined: Oct 11, 2001
# Posts: 727
|
Posted: 2002-Mar-04 22:13
jcokos (from your first post here) are you aware that the perl interpreter is written in C? To suggest that perl 'outperforms' C by 3:1 indicates you do not have efficient C code.Perl is interpreted, C is compiled... and you can't beat compiled software. As for Google's speed, no doubt every piece of code was written by Google staff from the ground up, including their database software. I can't imagine Google using commercial (or open source) software there, especially with their unique use of hardware - namely an entire index stored in DRAM (rather than hard drive) distributed amongst 10-20 thousands machines.
There's a lot of information out there on Google's set up thanks to interviews with Google's technical staff... it's worth a look.
|
 |
jcokos
Staff
Joined: Eons Ago
# Posts: 145
|
Posted: 2002-Mar-05 00:10
quote:
Perl is interpreted, C is compiled... and you can't beat compiled software.
Incorrect. Granted, perl is interpreted, but when you write perl programs, for apache, with mod_perl/fastCGI/CGISpeedy in mind, you can achieve 3:1 performance gains over C. Simply put, all of these technologies precompile the perl program into memory, where they stay, persistantly. (Each one does it a different way). The benifet is that when someone from the browser hits one of these "sped up" programs, unlike normal perl or c, apache doesn't need to load & compile the perl, and unlike C, it doesn't even need to run the program at all. All of the startup overhead is gone, so your application is better than instant..... Trust me, a simple "Hello World" written in perl vs the same thing written in C runs 2x faster under mod_perl. If you make the application more complex, and throw in database connections, big queries, latency, etc, the gap widens,and perl really kills C... Straight perl vs c is no contest, c wins that hands down. I've done LOTS AND LOTS of benchmarking on this stuff.... As for google, I think it was written in C, by a group of college kids as a class project, and then one thing led to another...
|
 |
jcokos
Staff
Joined: Eons Ago
# Posts: 145
|
Posted: 2002-Mar-05 04:13
Rally,Yes, Hyperseek runs perfectly with Turbolinux and Zeus (That's what I run on my development machines). Zeus claims a huge speed improvement over apache for normal .html and images too. Not sure how they do it, or how to measure it at low levels, but it sure feels fast  John
|
 |
4dam W
Joined: Oct 11, 2001
# Posts: 727
|
Posted: 2002-Mar-05 04:43
jcokos, we are talking in two completely different contexts, which cannot hold for an argument. (Mine was general, yours turned into something specific... nevertheless).A few months ago I read that Google was predominantly powered by software written in C++ ... ?
|
 |
You are not permitted to post messages in this forum or topic, because of one or more of the following reasons:
- You have not yet logged in, or registered properly as a member
- You are a member, but no longer have posting rights.
- This is a private forum, for which you do not have permissions.
If you are a recent member, it's possible that you simply have not yet confirmed your account. Please
check your email for a message entitled 'JimWorld Forums: Confirm Your Account' and follow the instructions
contained within.
If you cannot find this message, click here to Re-Send it.
|
If you are still experiencing problem, please read the
Login Assistance
Article for some advice on what may be causing your login not to work properly.
|
Switch to Advanced Editor and ...
Create a New Topic
or Reply to this Thread
|
|