You manager has asked you to build an application that can summarize PDF documents. Since there are budget challenges, you must keep the operational cost of the application low. It is OK even if the performance is not the best.
PDF document sizes are not fixed and can be up to few MBs. The LLM your organization uses have a context window size of 32 KB !! which obviously cannot fit the entire content of the PDF document in a single prompt. There are 2 options that can address this challenge:
Since budget is an issue, you have decided to go with option#2.
The illustration below describes the flow.
A typical PDF can be 100’s of KB to few MBs. A complete PDF cannot be accomodated in the input to an LLM. E.g., a 100 KB PDF document cannot be sent in entirety for summarization to Mistral-7B-Instruct-v0.2 LLM as its context window size is 32 KB.
In order to address the context window size challenge, you will use incremental summarization process. This process involves:
For ease of understanding, the logic for doing the above is available in a notebook. Before proceeding, check it out. Use it as-is or modify it for use in the application.
The project solution is available under the streamlit folder.