Ricardo León
Ever wondered how long it would take to read the entire internet? It’s not just a daunting task—it’s impossible. With billions of web pages growing by the second, the internet is an ever-expanding universe of information. If it stopped growing today, it would still take one person over 3,800 years to read just the accessible, text-based content (forget about sleeping and eating!). This staggering fact highlights the immense scale of digital information we are exposed to.
Like the internet, enterprise data has grown exponentially, overwhelming human analysis and comprehension capacities. Companies and their customers generate vast amounts of information daily, making it challenging to extract timely insights.
Enter Large Language Models (LLMs) for text summarization, benefiting a wide range of industries by enhancing information processing and decision-making efficiency. For example:
- In healthcare, it can condense patient records and medical research, aiding in faster diagnosis and treatment planning.
- In finance, it can quickly summarize market reports and financial statements, allowing professionals to make informed investment decisions swiftly.
- In the legal field, LLM summarizing can streamline the review of lengthy documents and case files, reducing the time required for legal research and preparation.
This blog post will focus on the last one: legal cases.
Use Case: Legal Document Summarization
Pureinsights has helped a legal firm improve its process for summarizing personal injury files by leveraging Large Language Models (LLMs). This task, previously a manual and time-consuming effort requiring lawyers’ attention, has been automated through an innovative LLM-based process. This has boosted the firm’s productivity by freeing up valuable time for their attorneys, enabling them to focus on other critical aspects of the business.
This blog post explores how this was achieved.
To Fine-tune or not to fine-tune?
Fine-tuning was considered as possibility for summarizing legal cases.
Fine-tuning can offer significant benefits, including domain specificity, enhanced accuracy, task optimization, and more. However, it entails time-consuming and resource-intensive iterations that demand high-quality curated input data to refine the model. Fine-tuning is typically reserved for specialized scenarios with unique requirements that cannot be met using existing general-purpose models.
Moreover, the artificial intelligence landscape is evolving rapidly. New, faster, smarter, and more cost-effective models are regularly introduced, capable of tackling increasingly complex tasks, potentially reducing the necessity for fine-tuning and making already fine-tuned models obsolete.
Given these factors, Pureinsights decided to experiment with existing models from different providers before considering fine-tuning, focusing on prompt engineering and/or prompt chaining to achieve high-quality results. Results presented here demonstrate that it was a good decision for this use case.
What model to use?
OK, so we decided not to fine-tune, then the next natural step is to decide what model to use. There are many LLMs out there from various providers:
- OpenAI: gpt-4o, gpt4-turbo
- Anthropic: Claude 3 Sonnet, Claude 3.5 Sonnet, Claude 3 Opus
- Meta: Llama 3
- Google: Gemini 1.5 Pro
Just to name a few. But which one should we use? There are a few variables that need to be considered when selecting a model:
- Potential Uses: Has the model been designed with a purpose in mind? (e.g., chatbots, long-running complex tasks)
- A balance between accuracy and response time is advisable: Obtain quality summaries in a reasonable time.
- Context Window: How much input information can I provide per single call to the LLM?
- Depending on the length of the text to summarize, several invocations are required.
- Cost: Accessing LLMs can get expensive quickly. Both input and output tokens count towards cost calculations. Newer (and more accurate and robust) models tend to be more expensive.
- Quotas: LLM providers tend to limit usage (number of operations or number of processed tokens) during a specific amount of time. This is typically tied to usage tiers. Quotas vary widely between providers and models.
- Depending on the business case, a model could meet other criteria points, but won’t cope with expected load due to quota limitations.
Considering the requirements of the summarization task and the above-mentioned factors, the following models serve as a solid starting point for experimentation:
- Antrophic Claude 3 Sonnet
- OpenAI gpt-4o
Summarization
Summarization requires two key elements:
- A tuned prompt: Contains a set of instructions that will be passed to the LLM indicating it to generate a summary following a set of guidelines/given format.
- A summarization strategy: Defines how to summarize large texts that don’t fit in the LLMs context window.
These two elements are intertwined, as the prompt is tweaked according to the strategy.
Strategies
The following sections explain three strategies for summarizing content with LLMs.
Stuff
The complete text to be summarized (in this case the complete JSON object) is stuffed to the LLM alongside a prompt. The LLM will produce and return a summary for the whole content. A single prompt is required.
Pros:
- Straight forward approach.
- Easy to implement.
- Complete context is presented to the LLM in one API call.
Cons:
- Not suitable for large documents that don’t fit into the model’s context window.
- While effective for small text, when dealing with larger text, even if it fits within the model’s context window, it may generate a simplistic summary that overlooks crucial elements.
Map-Reduce
Text is split into smaller chunks, which are summarized individually, and then a final summary of the partial summaries is produced. Two prompts are required: one for summarizing each chunk, and one to produce the final summary from the partial summaries.
Pros:
- More specific details from each chunk can be captured.
- Texts that don’t fit in a model’s context window can be summarized.
- Different chunk granularities can be configured to obtain different results.
- Individual summaries can be executed in parallel.
Cons:
- More requests to LLM (one per chunk + one final summary) which increases cost.
- Text cohesion might be lost at chunking time, which could lead to context continuity loss.
- If a large text is split into too many chunks, the input for the final summarization stage might be too large to fit into the context window.
Refine
The text is divided into smaller chunks. A summary is created for the first chunk, generating an initial output. For each subsequent chunk, the summary of the previous chunk is combined with the new chunk to create a refined summary, resulting in a rolling summary. One prompt is required. For the subsequent executions an extension needs to be added to include the partial summary.
Pros:
- Same benefits of Map/Reduce.
- Continuity of context between chunks is preserved, which potentially yields a better summary.
Cons:
- Input context is larger as it contains rolling summary + current chunk.
- LLM costs are higher.
- Chunks might be smaller compared to Map/Reduce, as the rolling summary consumes part of the context window.
- Cannot be parallelized.
Experimentation
Python was created to experiment with the following constraints:
- Summarization Techniques: Map-Reduce and Refine
- Models: OpenAI’s gpt-4o and Anthropic’s Claude Sonnet 3.
- Different Chunk sizes defined per model
- Simple chunking strategy
- Text is split based on number of tokens.
NOTE: The Stuff strategy has been deliberately excluded as large chunk sizes would provide the same effect for texts that fit within that defined size for any of the above-mentioned strategies.
A total of eight different summarization scenarios were executed using various combinations of models, strategies, and chunk sizes:
Results
Execution time
Execution time has been measured while creating summaries for each one of the scenarios mentioned above. The objective is to grasp initial performance metrics. However, these metrics are indicative and do not accurately represent actual performance or allow for extrapolation to production.
From this exercise, it was concluded that:
- As the chunk size is reduced, performance is slower, as more requests to the LLM are made.
- The ‘Refine’ strategy shows slightly slower performance compared to ‘Map Reduce,’ likely because the rolling summary is passed to each subsequent request.
- Parallelizing Map/Reduce speeds up execution time.
- Summarization can take a 15-30 seconds to execute, depending on the size of the input text, therefore it is advisable to execute this as an asynchronous process.
Initial Observations
- GPT-4o tends to produce longer summaries than Claude 3 Sonnet.
- This doesn’t necessarily mean that summaries are better, but it might capture more details.
- The same prompt was used for both models. Perhaps, modifications to the prompt can be made to extend Claude’s response.
- Using smaller chunk sizes results in slightly more detailed outcomes, which can be especially beneficial for extensive use cases with rich details.
Qualitative results analysis is inherently subjective and is typically best conducted by subject matter experts (SMEs). However, LLM models have been used to judge the quality of the produced summaries.
Evaluation using LLMs
Two approaches were followed to evaluate summaries using LLMs:
- Summary to summary comparison
- Evaluation against reference summary (i.e., expected summary)
For both approaches, Claude 3.5 Sonnet has been used as evaluator to compare summarization scenarios for each legal case and give each summary a score from 1 to 10.
The same evaluation criteria were used for both cases. The main difference in the second evaluation scenario was the inclusion of a reference summary for comparison in the prompt.
Many points were considered as evaluation criteria, some of them are Structure, Clarity, Relevance. Details Granularity, Coherence, Tone and other specific business criteria.
Moreover, a scoring guide was provided in the prompt:
- 1-3: Poor quality; many criteria missing or poorly addressed.
- 4-6: Average quality; some criteria are well-addressed, but there are significant gaps.
- 7-8: Good quality; most criteria are well-addressed with minor gaps.
- 9-10: Excellent quality; all criteria are comprehensively and accurately addressed
Results
The following table depicts the average score for each processed case, using the two evaluation approaches
From these results it can be concluded that:
- OpenAI’s GPT-4o provides better quality results than Claude 3 Sonnet
- ‘Refine’ strategy produces summaries with higher scores than ‘Map Reduce’
- Larger chunk sizes capture more details and that influences quality.
Conclusions
Summarizing legal cases using out-of-the box LLMs is an innovative solution that has helped one of our clients significantly by speeding up manual processes and by freeing up attorneys to pay attention to other relevant aspects of the business.
We can help you to leverage LLMs to optimize your business use case with state-of-the-art AI technology. CONTACT US today for a free 1:1 consultation to get started.