Skip to content

Page highlighting #109

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added CS410TeamWXYZ-ProgressReport.pdf
Binary file not shown.
Binary file added CS410TeamWXYZ-ProjectProposalSubmission.pdf
Binary file not shown.
Binary file added ProjectDocumentationAndUsage.pdf
Binary file not shown.
116 changes: 114 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,115 @@
# CourseProject
# CS410-Final-Project
A chrome extension that indexes the current page to allow users to search over the page using common retrieval functions.

Please fork this repository and paste the github link of your fork on Microsoft CMT. Detailed instructions are on Coursera under Week 1: Course Project Overview/Week 9 Activities.
[Software Documentation](#implementation)

[Usage](#usage)

[Demo Video](https://www.youtube.com/watch?v=agyJ-4IclAc "Demo Video")

# Overview

The objective of this project is to develop a Chrome extension for intelligent browsing. Specifically, we developed a Chrome extension that indexes the current page to allow users to search over the page using the OkapiBM25 ranking function from the metapy package. In addition, we implemented a frontend UI that the users can use to input a search string, and the top 5 results are displayed back to the user.
The text preprocessing, cleaning, indexing, and ranking functionalities are handled in the backend flask-api using the metapy package.

## Implementation
The software is implemented in two components:

### (1) Frontend – Chrome extension

The core UI is implemented in popout.html and popout.css, consisting of the UI code for search and display results. The extension is located in the chrome_intelli_search folder.
backend.js file contains the event listeners to handle the search submission on click by extracting the text from the currently active tab. We utilize the chrome.scripting.executeScript to execute the text scraper functionality on the active window, POST and fetch data from the backend flask-api using the javascript ‘fetch’ method and dynamically render the resulting json results. We utilize basic HTML, CSS, Javascript, and default Chrome extension APIs to accomplish the frontend-backend communication and UI rendering.

### (2) Backend - Flask API server

The backend server is designed with Flask-API, and the text search indexing/ranking functionality is developed using the MetaPy packages. All the required packages to run the server are in the requirements.txt file. The server is run locally for the initial implementation is per user. The flask-API server is located in the flaskapi folder.

The Flask-API implements a /search route that expects at least two parameters, the raw text from the HTML page(corpus) and the search string. Optionally, a ranker option can be passed to test out other available rankers. However, for this project, we implement OkapiBM25 as the default ranker.
The search engine pipeline is a three-stage pipeline with Preprocessing, tokenization/stemming, and indexing/ranking stages.

### Preprocessing Pipeline:
1. Split lines
2. Get rid of lines with one word or less
3. Create data set each sentence in a line
4. Write to dataset file from the configuration file

### Tokenization/Stop Word removal/Stemming/N-Grams
We utilize the Metapy’s rich API to perform further preprocessing of the documents as follows:
1. ICUTokenizer to tokenize raw text and provide an Iterable that spits out the individual text tokens.
2. Lowercase to convert all tokens to lowercase
3. Porter2 Stemmer to stemmer, or lemmatize the tokens by reducing the words from its inflections.
4. Ngram-word-1 to bag of words” representation or “unigram word counts”.

### Indexing and Ranking

An inverted index is created using the metapy’s make_inverted_index function and BM25 ranker is instantiated with the parameters k1=2.4, b=0.75 and k3=400. A query document is created using the search string and the ranker is queried to score and return the top-5 results.

#Project Environments:
1. Python 3.7, Metapy, Flask-API
2. Chrome


### Testing the Backend :
#### Sample Server Request/Response
```
$:~/code/admin/ill/CS410Text/CS410-CourseProject-Team-WXYZ/flaskapi
$ curl -X POST -H "Content-Type: application/json" -d '{"corpus":"Hello Inteli Searcher", "search": "searcher", "ranker": "test"}' -i http://127.0.0.1:5000/search
HTTP/1.0 200 OK
Content-Type: application/json
Content-Length: 45
Access-Control-Allow-Origin: *
Access-Control-Allow-Headers: Content-Type,Authorization
Access-Control-Allow-Methods: GET,PUT,POST,DELETE,OPTIONS
Access-Control-Allow-Credentials: true
Server: Werkzeug/1.0.0 Python/3.7.6
Date: Fri, 09 Dec 2022 04:30:45 GMT
```

```
{"search_results":["Hello Inteli Searcher"]}
```

# Usage:
#### Requirements: Python Version 3.7

### 1. Run Flask-API server
```
$cd /flaskapi

Optional: Setting pythong 3.7 environment in conda.

$ conda create -n testenv python=3.7
$ conda activate testenv

**This project was tested on python version 3.7. Please use a Python version 3.7 environment.

$ pip install -r requirements.txt
$ python app.py
```

![Server](./images/app_server.png "")

Testing the API:

```
$ curl -X POST -H "Content-Type: application/json" -d '{"corpus":"Hello Inteli Searcher", "search": "searcher", "ranker": "test"}' -i http://127.0.0.1:5000/search
```
![API Image](./images/api_test.png "")


### 2. Install Chrome Extension


![Chrome](./images/install_extension.png "")


### 3. Browse, Search & Results


![Browse](./images/browse.png "")
Items are highlighted once the user clicks on each of the search results. Items are not unhighlighted, however, until page is refreshed.


### Contribution of Team:

All team members participated in all the backend and frontend work. Lot of hours were spent on learning the technology and getting up to speed on tasks such as developing Chrome extensions and developing in Flask-API.
90 changes: 90 additions & 0 deletions chrome_intelli_search/backend.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
//Get all the UI objects for manipulation
let searchButton = document.getElementById("search");
let inputSearch = document.getElementById("search-box-input");
let resultsContainer = document.getElementById("searchResultsContainer");
let ranker = document.getElementById("documentTypeDropdown");
let resultsList = document.getElementById("search-results");
let lastItem = '';
//Add Click event listener for the search button
searchButton.addEventListener("click", async () => {
let [tab] = await chrome.tabs.query({ active: true, currentWindow: true });

//Utilizing chrome extension scripting library execute script to send the page text and search string
//to backend api for searchinging
//The executeScript executes the searchText function in the current active tabs context and scrapes the text of the HTML document
//On success calls the fetch data to retrieve the ranked results.
chrome.scripting.executeScript({
target: { tabId: tab.id },
func: searchText,
},
(response) => {
fetchData(response[0].result, inputSearch.value, ranker.value).then((data) => {
searchButton.innerText = 'Search';
resultsList.innerHTML = '';
//Construct results list
for(const item of data.search_results) {
const li = document.createElement("li");
li.innerText = item;
li.addEventListener("click", async() => {
chrome.scripting.executeScript({
target: { tabId: tab.id },
func: elemsContainingText,
args: [item]
});
})
resultsList.appendChild(li);
}
let x = JSON.stringify(data.search_results);
console.log(x);

});
});

});

//Function to fetchData from the backend
//Currently the server runs on the local machine
async function fetchData(corpus, search, ranker) {
searchButton.innerHTML = '<i class="fa fa-refresh fa-spin"></i>Searching...';
const url = "http://127.0.0.1:5000/search";
data = {
"corpus": corpus,
"search" : search,
"ranker": ranker
};

let results = await fetch(url, {
method: 'POST',
cache: 'no-cache',
//mode: 'no-cors',
headers: {
'Content-Type': 'application/json'
},
body: JSON.stringify(data)
}).then(response => response.json())
.then(json => {
console.log(JSON.stringify(json));
return json;
})

return results;
}

// The body of this function will be executed as a content script inside the
// current page to extract the current tabs to text to search
function searchText() {
return document.body.innerText;
}

function elemsContainingText(item) {
console.log('item',item);
let elementList = [...document.querySelectorAll("p,h1,h2,h3,h4,h5,h6,li,span")];
console.log(elementList);
for (let el of elementList) {
if (el.innerText.includes(item)) {
console.log(el);
el.style.backgroundColor="yellow";
}
}
return;
}
6 changes: 6 additions & 0 deletions chrome_intelli_search/background.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
let color = '#3aa757';

chrome.runtime.onInstalled.addListener(() => {
chrome.storage.sync.set({ color });
console.log('Default background color set to %cgreen', `color: ${color}`);
});
Binary file added chrome_intelli_search/images/get_started128.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added chrome_intelli_search/images/get_started16.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added chrome_intelli_search/images/get_started32.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added chrome_intelli_search/images/get_started48.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file.
26 changes: 26 additions & 0 deletions chrome_intelli_search/manifest.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
{
"name": "CS410 Final Project Extension",
"description": "This extension does some thingies",
"version": "1.0",
"manifest_version": 3,
"background": {
"service_worker": "background.js"
},
"permissions": ["storage", "activeTab", "scripting", "tabs"],
"action": {
"default_popup": "popup.html",
"default_icon": {
"16": "/images/get_started16.png",
"32": "/images/get_started32.png",
"48": "/images/get_started48.png",
"128": "/images/get_started128.png"
}
},
"icons": {
"16": "/images/get_started16.png",
"32": "/images/get_started32.png",
"48": "/images/get_started48.png",
"128": "/images/get_started128.png"
},
"options_page": "options.html"
}
38 changes: 38 additions & 0 deletions chrome_intelli_search/node_modules/.package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

21 changes: 21 additions & 0 deletions chrome_intelli_search/node_modules/@types/chrome/LICENSE

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

16 changes: 16 additions & 0 deletions chrome_intelli_search/node_modules/@types/chrome/README.md

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading