CS 212 Software Development

CS 212-01, CS 212-02 • Fall 2020

Project 4 Search Engine

Associated Assignments Project 4 Search Engine


For this project, you will extend your previous project to include a search engine web interface using embedded Jetty and servlets to search that index.

This writeup is for the search engine functionality only. See the general Project 4 Writeup for more details.

Functionality

The functionality for this project is broken into 2 parts: core functionality and extra features. You must complete the core functionality before attempting extra functionality.

Core Functionality

In addition to maintaining the functionality of the previous project, your code must support the following core features using embedded Jetty and servlets for a total of 40 points:

  • Search (20 points): Display a webpage with a text box where users may enter a multi-word search query and click a button that submits that query to a servlet in your search engine.

  • Results (20 points): Upon receiving a multi-word search query, the servlet in your search engine should perform a partial search from an inverted index generated by your web crawler and return a dynamically generated HTML page with sorted and clickable links to the search results.

You should not begin working on extra features until the core functionality is working properly.

Additional Functionality

Once the core functionality is complete, you must implement 60 points worth of additional features. These features are broken into three categories: user tracking, database support, and other functionality. You may choose any combination of features from these categories.

  • User Tracking: The following features require you to use session tracking and/or cookies to store per-user information. Features that store data in memory instead of using session tracking will NOT receive full credit (15% deduction per feature).

    • Search History (10 points): Store a history of all search queries entered by a user. Allow the user to view and clear that history.

    • Visited Results (10 points): Store a history of all search results visited by a user. Allow the user to view and clear that history.

      Hint: Modify the search result links to direct back to your search engine, so that you may first store that the link was visited and then redirect the user to the link selected.

    • Favorite Results (10 points): Allow a user to save favorite search results, and allow the user to view and clear those favorites.

      Hint: Add a special link to each result that saves it as a favorite, but consider how to do this in the least disruptive way for the user.

    • Time Stamps (5 points): Add timestamps to each item stored per user. For example, add timestamps to the user’s search history. You are expected to implement this for all related features to earn full credit. For example, if you implement search history and visited results, timestamps should be added to BOTH features for full credit.

    • Private Search (5 points): Allow users to set an option that turns off all tracking of per-user data. You are expected to implement this for all related features to earn full credit. For example, if you implement search history and visited results, tracking should be turned off for BOTH features for full credit.

    • Last Login Time (5 points): Track and display the last time the user visited your search engine.

  • Database Support: The following features require you to connect your search engine to a database over JDBC. Features that store data in memory instead of using session tracking will NOT receive full credit (25% deduction).

    • Page Snippets (15 points): During a crawl, store short snippets of every webpage found in a database and display these snippets whenever that page is returned as a result.

    • Last Crawled (5 points): This feature requires you implement the “Page Snippets” feature. When crawling pages and storing page snippets, also store a timestamp of when that page was crawled in the database. Whenever that page and snippet is returned as a result, display the crawled date as well.

    • Popular Queries, Database Edition (15-20 points): Every time a search is conducted, parse/clean/optimize that query and store the number of times that query has been searched for in a database. Allow users to see the top 5 most popular queries on your search page.

      The base functionality of this feature is worth 15 points. You can earn an additional 5 points if you make those queries clickable such that when clicked, the results for that query is displayed (i.e. clicking a popular query conducts a search for that query).

      Note: You cannot implement both this and the non-database “Suggested Queries” feature.

    • Most Visited and/or Favorited Results (10-15 points): This feature requires you implement at least one of the “Visited Results” or “Favorite Results” features. When storing the visit or favorite in the user session, also increment the number of times that page has been visited or favorited in a database. When displaying the user’s visit or favorite history, also show the top 5 visited or favorited pages overall from the database.

      If you implement this for just one feature (visited -or- favorited results, but not both) it is worth 10 points. If you implement this for both the visited and favorited results features, it is worth 15 points.

    • Reset Database (5 points): This feature requires you implement at least one database feature. Allow users with an administrator password to clear all the tables in the database associated with your search engine.

  • Other Functionality: The following features allow you to customize the functionality of your search engine.

    • New Crawl (5 to 15 points): Allow a user to add new URLs to the inverted index. Specifically:

      • New Seed (5 points): Allow a user to enter a new seed URL that should be added to the existing inverted index. If the URL has already been crawled, skip crawling that URL and output a warning to the user.

      • Max Support (10 points): In addition to entering a new seed URL, allow the user to also specify a maximum number of pages to crawl. This will represent the maximum number of new pages to crawl in addition to the pages already crawled. URLs that are already included in the inverted index should be skipped and should not contribute to this maximum count.

    • Suggested Queries (10 points): Provide users five suggested queries based on either the latest queries made by other users -or- the most popular queries made by other users.

    • Graceful Shutdown (10 points): Allow an administrator to trigger a graceful shutdown of your search engine without calling System.exit(). You will need to create a special servlet for this feature.

    • Index Browser (5 points): Allow users to browse your inverted index as an HTML page with clickable links to all of the indexed URLs.

    • Location Browser (5 points): Allow users to browse all of the locations and their word counts stored by your inverted index as an HTML page with clickable links to all of the indexed URLs.

    • I’m Feeling Lucky Button (5 points): Add a new button to your search page (in addition to the normal search button) that automatically redirects the user to the first search result instead of listing all of the search results. This is similar to the “I’m Feeling Lucky” button that Google Search used to include on its page. You have to consider what to do if there are no search results!

    • Page Statistics (5 points): In addition to providing a clickable link for each search result (i.e. web page), display the page title (via the <title> tag in HTML), word count, and content length (via the Content-Length HTTP header). This information can be stored in-memory (no database connectivity required) by your web crawler, except word count is already stored by your inverted index.

    • Search Statistics (5 points): Display the total number of results along with the time it took to calculate and fetch those results, and display the score and number of matches per search result listed.

    • Reverse Sort Order (5 points): Allow the user to select an option to reverse the sort order of the search results.

    • Partial Search Toggle (5 points): Allow the user to toggle on/off partial versus exact search.

    • Web Framework (5 points): Design a search engine using any popular CSS/style framework to create a consistent style for all the web pages. For example, consider using Bulma, Bootstrap (Twitter), Pure.css, Material (Google), Semantic UI, and many more.

    • Search Brand (5 points): Design a search engine with a distinct brand, logo, and tagline. This includes creating a logo and tagline, and including it on all of the web pages. Do not use unlicensed unattributed media on your website.

Have a feature idea? You can propose an extra feature in a public post on Piazza. If approved, the instructor will post the number of points that feature will be worth on the final project.

Potential Deductions

There will be a brief review of your code for this project. If your code has any of the following issues, points may be deducted from your the project grade:

  • Cross-Site Scripting (XSS) Vulnerabilities (-5 points): Your grade will lose up to 5 points if your code do not protect against cross-site scripting (XSS) attacks. Your code must escape or sanitize any data it uses from a user (either via the HTTP request or a database) prior to using it on an HTML page.

  • SQL Injection Vulnerabilities (-5 points): Your grade will lose up to 5 points if your code does not protect against SQL injection attacks. (This only applies to search engines using a database.) Specifically, your code must use prepared statements anytime it stores data in a database.

  • Poor Multithreading/Multiuser Support (-5 points): Your grade will lose up to 5 points if your code does not fully support multithreading and multiple users. For example, suppose two users conduct a search simultaneously. The second user should not the search results for the first year.

  • Poor Encapsulation (-5 points): Your grade will lose up to 5 points if your code breaks encapsulation.

  • Poor Code Style (-5 points): Your grade will lose up to 5 points if your code is not professionally styled and documented. This includes the formatting, variable names, Javadoc, warnings, and exception handling.

While there are many ways to lose points, the total possible deduction is capped such that no more than 10 points total will be removed from your project grade due to the above issues.

Extra Credit Functionality

You may complete extra functionality to earn extra credit in the project category. See the primary Project 4 writeup for details.

Input

Your main method must be placed in a class named Driver. The Driver class should accept the following additional command-line arguments:

  • -server port where -server indicates a search engine web server should be launched and the next argument port is the port the web server should use to accept socket connections. Use 8080 as the default value if it is not provided.

    If the -server flag is provided, your code should enable multithreading with the default number of worker threads even if the -threads flag is not provided.

The command-line flag/value pairs may be provided in any order, and the order provided is not the same as the order you should perform the operations (i.e. always build the index before performing search, even if the flags are provided in the other order).

Your code should support all of the command-line arguments from the previous project as well.

Output

The majority of the output for this project will be in the form of HTTP responses to a browser. Only output the inverted index or search results to a file if the necessary flags are provided.

Testing

No tests will be provided for this project. Instead, you will demonstrate your search engine functionality to the instructor during your final code review appointment during finals week.