Finally, the years of being in the trenches of discrete math classes, data structures and algorithms, and general programming are being tested by a project I began working on this past week. Essentially, the project is to build a web crawler + search engine, with the only constraint being that it is to be built in Java. Thus, there is a lot of freedom in both algorithms used and system operations.
At the expense of my other course work, this project has taken over my life in a way, with my days consisting of a constant flurry of new ideas and tweaks to squeeze some extra performance out of different components. I’ve completed an initial version of the web crawler that is able to fetch, parse, and insert + rank URLs at a rate of ~340/ minute on a Mid-2011 MacBook Air (2 cores, 4 GB RAM).
In the coming week, I hope to introduce threading to the crawler and see if I can get some efficiencies between the processes of fetching a URL, parsing the document for URLs and words, and existence check / database insertion. The goal is to get to 1000 URLs / minute on this hardware before jumping into searching + jsp page creation. The ultimate goal is to implement a version of PageRank with some sort of live search (character by character like Google does), but we’ll see how well the crawler goes on this hardware first. Time to brush up on automatas and grammars!
A side note about Java; I never took seriously a friend who used to harp on using Java for applets / programs such as this. But it seems however many new solutions I’ve researched that have been developed to replace Java, Java always seems to come through as a preferable choice whether it be security, scalability, or just plain efficiency (HR or complexity).