Skip to content

LuckyBian/SearchEngine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

57 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SearchEngine

This project is based on java and spring boot to complete a search engine that allows users to get relevant web pages by entering keywords. The first stage of the project is to carry out the collection of data. This phase uses multi-threading to improve collection efficiency and uses etymology, stopword, mapping and other methods to save storage space. The second phase of the project is a web search, consisting of a single keyword search and a two keyword search. The final web page information is displayed by spring boot.

Quickly Start

Download the code and run 'SearchEngineApplication'
If it is the first time you run the code, it need some time to load the data.
The data will be saved in data_table.ser
After data are collected, you can go to localhost:8080
You can type one single word or two word with certain format(word1+word2,word1-word2,word1!word2)

Some Details

Data Structures

(1).DataTable
The gathered data is stored in DataTable. It has an index whose data type is Map<String, Set>. In other words, the index is a map from a keyword, to a set of PageInfo objects.
By using the index, we can perform O(1) time search for a set of page metadata, given a keyword. It is also storage-efficient, as it only stores references to PageInfo objects, rather than multiple duplicated data.

(2).PageInfo
PageInfo is an object to store the information of a website including the title and link. All keywords related to this website will point to this object, which will avoid duplicate storage of the same website and save storage space.

(3).URL
A collection of URLs to be processed. When information about a website is extracted, the links contained in this website are filtered and stored in the URL Pool for processing. The URL Pool can store at most 10 URLs and the capacity of the URL Pool is controlled by the variable U.

(4).PURL
A collection of processed URLs. Once a site's information has been extracted, the site will be stored in the PURL Pool. The PURL Pool can store at most 100 URLs and the capacity of the URL Pool is controlled by the variable V.

Gather Data

If the data file data_table.ser is not found, the program will collect the web data and save it to data_table.ser at the end of the collection. The specific steps for data collection are as follows:

  1. Before data collection begins, the program will create 10 threads to allow them to work together to improve the efficiency of data collection. When the program is running, the work status and progress of the 10 threads will be printed continuously until the end of the work. The number of threads is controlled by the variable N_THREADS.

  2. Once the thread has been created, the first link seedUrl will be defined and stored in the URL variable.

  3. The program will extract various information from the web page of currUrl (the first link in the URL) and store it in different variables:
    (1). content: the html format string of the webpage.
    (2). title: the title of the webpage.
    (3). text: all text content of the webpage.
    (4). cleantext: all keywords of the webpage.
    (5). newUrl: all links in the webpage.

  4. These variables will be filtered and processed while step 2 is running:
    (1). The user can filter out Chinese pages and garbled pages by using the variable webFilter. If this feature is turned on, once a Chinese or excessively long title is detected in the content, the page will be discarded and processed directly to the next link in the URL.
    (2). The title will be reorganized to remove unnecessary line breaks or excessive spaces.
    (3). The text will be broken down into individual keywords and then filtered. First, the text will be stripped of all punctuation and split into separate words. These words will then be converted to lower case letters and the user can then control whether the keyword is etymologized via the variable stem. Finally, keywords will be stored in cleantext as long as they are alphabetic, not duplicated and not in the blacklist.

  5. After filtering, the variables will be stored as long as there are no duplicates. title, keyword in cleantext and currUrl will be stored in DataTable,[keyword->(title,currUrl)], as long as there is no duplication.newUrl will be stored in URL,as long as there is no duplication currURL will be moved from URL to PURL.

  6. Step 3 - 5 will repeat until the links in PURL equal to V(100).

  7. DataTable will be saved in the datafile data_table.ser.

Once the data has been collected, it can be imported directly the next time the program is run. The search function will feed the user with search results based on the keywords and web pages collected.

Search Website

The web server is implemented with Spring Boot Framework.

When constructing the Controller bean, it will try to load the scraped data from the data file data_table.ser. If the data file is not found or corrupted, it will re-run the Gather Data step.

The main function resides in the search method (endpoint /search). It first tells if the input is a URL. If so, it just redirects to the URL. Otherwise, it will perform the following processes.

1.Determine search keyword type, as specified in Section 1.2 in the Project Specification.

2.Perform keyword stemming. This is meant to improve the search results by matching all inflections of a keyword. The stemming function is provided by a third-party library Snowball Stemmer.

3.Retrieve result set(s) from the data table index. For two-word searches, it utilizes the set operations intersection/union/except to achieve the AND/OR/Exclusion matchings, and produce a final result set.

  1. The result set is rendered according to the HTML template by the Thymeleaf view engine. Each result consists of a title and a URL.

In addition, we provide an API /raw-data to fetch all gathered data.
When you enter the website, it will first prompt the searching tips. You can then follow the tips to input search query.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages