The Finished Product: Web Crawler + Search Engine

After 6 long weeks of work, I finally wrapped up my CS 390 Web Crawler + Search Engine project. Although only worth 1 credit hour, I must have spent at least 30 a week working on it. The best part; it was easy. Finally, after sitting through so much theory and mundane labs and projects, I could finally apply all the different things I’ve learned throughout my coursework.

Of the things I set out to accomplish, I achieved almost all of them save a few database optimization features which I simply ran out of time to implement. My Web Crawler managed to crawl at a rate of roughly 800 URLs fetched and parched per minute, mainly through the (painful) implementation of threading and breadth-first URL grabbing. The search engine itself used Java Servlets to perform the computations, with Ajax requests getting the necessary information.

The three applets created were an autocomplete servlet for the user input, a URL list generator from the user query ranked with a simplified version of PageRank, and a special person servlet to return the information of Professors. Ranking the results was probably the most difficult part of the project, as it took a few iterations to get it right. For example, in addition to incrementing a pages rank when another pointed at it, I also checked for the case of the user query appearing in the title of the webpage as well as artificially boosted the rank of faculty pages to ensure they were towards the top for a Professor name search. Very difficult to balance this and some additional tweaks are probably needed. The search query suggestions and results were calculated in real time, the former using a Mealy machine automata to predict the user input complete with state transitions and all. The result is a quick video I captured and uploaded to Instagram below. I will try and host this project on GitHub or a personal site, and will continue to work on improving the querying speed as well as evolve special cases.

Nonetheless, I suggest anyone and everyone at Purdue who wants to take a Computer Science course try and take one with Professor Rodriguez- Rivera as his projects are tough, but relevant in the real world and get you thinking and applying the fringes of your skillset. Especially CS390, which is offered in C++, Java, and Python for 8 weeks each.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s