Project : PDF Summarizer

Objective

You manager has asked you to build an application that can summarize PDF documents. Since there are budget challenges, you must keep the operational cost of the application low. It is OK even if the performance is not the best.

PDF document sizes are not fixed and can be up to few MBs. The LLM your organization uses have a context window size of 32 KB !! which obviously cannot fit the entire content of the PDF document in a single prompt. There are 2 options that can address this challenge:

  1. Use an LLM with large context window
  2. Build the summary incrementally (multiple calls to LLM)

Since budget is an issue, you have decided to go with option#2.

Checkout the application (requires HuggingFace API token)

acloudfan/pdf-summarizer

Use this as a Sample PDF

Application flow

The illustration below describes the flow.

  1. User will provide the link to the PDF
  2. Application will load the PDF
  3. PDF pages will be combined to form chunks (size < context window size)
  4. LLM will be called repeatedly create the summary incrementally

project-summarizer-flow

Incremental summarization

A typical PDF can be 100’s of KB to few MBs. A complete PDF cannot be accomodated in the input to an LLM. E.g., a 100 KB PDF document cannot be sent in entirety for summarization to Mistral-7B-Instruct-v0.2 LLM as its context window size is 32 KB.

In order to address the context window size challenge, you will use incremental summarization process. This process involves:

  1. Loading the PDF as a set of pages
  2. Create a chunk composed of 1 or more pages. The number of pages in chunk are determined by the context window size of LLM
  3. Create the summary with the first chunk
  4. Iteratively - create the summary with partial summary and one chunk at a time

pdf-summarizer-chunking

For ease of understanding, the logic for doing the above is available in a notebook. Before proceeding, check it out. Use it as-is or modify it for use in the application.

Notebook

Solution

The project solution is available under the streamlit folder.

Solution