This is a two part project that involved implemting the core functionalities of a web crawler and creating a search engine from the ground up that is capable of handling thousands of documents and Web pages.
This project began by using the empty shell of a spacetime web crawler and implementing the core scraping and multithreading functionalities. The scraper function receives a URL and corresponding Web response that is parsed so that information and any additional links leading to other pages can be extract from the page. By seeding the crawler with any new links found, we are able to scrape, store and explore the web in a simultaneous process. For the purposes of this project the crawler was limited to 5 domains within the UCI domain and found 10167 unique pages. My implementation of the crawler also made sure to:
Implementing the crawler helped me learn how to gather and store relevant data from the web so that it can be easily indexed and searched through in the next part of this project. For the purposes of the crawler, data was simply saved in a json file in the format shown below.
The Search Engine component of this project uses a smaller subset of a couple thousand pages for the corpus as compared to the total 10167 unique pages that were previously crawled. The Search Engine contains two components: an indexer and the search engine itself. The indexer’s job is to extract the relevant information from each page and to create an inverted index for the corpus so that results can be easily searched for and presented to the user. My indexer uses a sqlite3 database to index all the pages. The smaller corpus size and database index allow my query time to be fast (sub 100 ms for queries with less than 10 tokens).
Below you can see how the indexer tokenizes each page using NLTK and BeautifulSoup4 libraries. Alphanumeric sequences are also checked for meaningless stop words and extra weight is given to important words in bold or headings. Porter stemming is used to shorten words to their root/base as a means for better textual matching.
In order to improve the ranking performance and search quality of the search engine I incorporated the following features:
In order to provide the search engine with a user interface, a Flask based web server is used to serve the front end searching functionality. When the user clicks search, a REST call is triggered to the running web server which then allows the response to be displayed on screen. Watch a demo of the search engine running below.