How LlamaIndex’s SubQuestionQueryEngine simplifies complex queries
Photo by Ruben Zavala on Unsplash
Following up on our last article, which focused on querying both structured and unstructured data using SQLAutoVectorQueryEngine. Let’s explore LlamaIndex’s SubQuestionQueryEngine, which handles complex queries, in this article.
First, let’s look at our use case of analyzing the financial reports of the United States government for the most recent fiscal years of 2021 and 2022. These findings are based on facts, on the most recent data available, and are apolitical.
The financial reports of the United States government are freely downloadable from the website of the Bureau of the Fiscal Service. The financial report provides the President, Congress, and the American people with a comprehensive view of the federal government’s finances.
For our demo app, we will download the executive summaries of the U.S. government’s financial reports for fiscal years 2021 and 2022. We will use these two summary reports as our data sources and ask questions about the content of these two reports.
I selected the executive summaries of those financial reports (about ten pages long for each summary report) as our data sources, partly to save the token cost, as the original full reports are about 200–300 pages long. Please download the full reports if you don’t mind spending a little on the tokens.
SubQuestionQueryEngine is designed by LlamaIndex to break down a complex query (e.g., compare and contrast) into many sub-questions and their target query engine for execution. After executing all sub-questions, all responses are gathered and sent to the response synthesizer to produce the final response.
Given our use case, we come up with the following high-level architectural diagram of how SubQuestionQueryEngine tackles complex questions related to the U.S. government’s financial reports.
Let’s dive into the detailed implementation.
The implementation steps are very much similar to how we implemented SQLAutoVectorQueryEngine in our previous article. Let’s take a closer look at the detailed implementation.
Define LLM and service_context
Let’s first define LLMPredictor by passing in ChatOpenAI with the model gpt-3.5-turbo. We then pass in llm_predictor to construct our service_context. We also set the global service_context object so as to avoid passing service_context when building the indices.
llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0, model_name=”gpt-3.5-turbo”, max_tokens=-1))
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)
#set the global service context object, avoiding passing service_context when building the indices
from llama_index import set_global_service_context
set_global_service_context(service_context)
Load data
For this demo app, we have already downloaded the summary reports of U.S. government financial reports for 2021 and 2022. These reports are in PDF format, and they are placed in a folder named reports in our project structure. Let’s call SimpleDirectoryReader to load these PDF docs and print out the number of pages for each report. See the code snippet below:
report_2021_docs = SimpleDirectoryReader(input_files=[“reports/executive-summary-2021.pdf”]).load_data()
print(f”loaded executive summary 2021 with {len(report_2021_docs)} pages”)
report_2022_docs = SimpleDirectoryReader(input_files=[“reports/executive-summary-2022.pdf”]).load_data()
print(f”loaded executive summary 2022 with {len(report_2022_docs)} pages”)
Build indices
Now, let’s build the indices for these two docs by calling GPTVectorStoreIndex, and print out the number of nodes for each report as debugging.
report_2021_index = GPTVectorStoreIndex.from_documents(report_2021_docs)
print(f”built index for executive summary 2021 with {len(report_2021_index.docstore.docs)} nodes”)
report_2022_index = GPTVectorStoreIndex.from_documents(report_2022_docs)
print(f”built index for executive summary 2022 with {len(report_2022_index.docstore.docs)} nodes”)
Build query engines
Building the query engines for each index is really straightforward. See the code snippet below. The similarity_top_k parameter specifies the number of most similar items or documents to return for each query. In this case, with similarity_top_k=3, the system will return the top 3 most similar items or documents for each query made using either the report_2021_engine or the report_2022_engine.
report_2021_engine = report_2021_index.as_query_engine(similarity_top_k=3)
report_2022_engine = report_2022_index.as_query_engine(similarity_top_k=3)
Build query engine tools
Once query engines are built, we need to wrap them in QueryEngineTool by passing in the specific query_engine and its metadata, which includes the name and description of the query engines.
query_engine_tools = [QueryEngineTool(query_engine = report_2021_engine,metadata = ToolMetadata(name=’executive_summary_2021′, description=’Provides information on US government financial report executive summary 2021′)),QueryEngineTool(query_engine = report_2022_engine,metadata = ToolMetadata(name=’executive_summary_2022′, description=’Provides information on US government financial report executive summary 2022′))]
Define SubQuestionQueryEngine
Notice we constructed our query_engine_tools as a list above, this is because we need to pass in query_engine_tools as a sequence of lists to construct SubQuestionQueryEngine. See the code snippet below:
sub_question_engine = SubQuestionQueryEngine.from_defaults(query_engine_tools=query_engine_tools)
Now that our code is in place, let’s launch our app and test it by asking a few complex questions related to comparing and contrasting certain categories in the summary reports.
Question #1: “Compare and contrast the DoD costs between 2021 and 2022”
Response:
In 2021, the DoD net costs increased by $144.8 billion due to changes in assumptions and slight increases in net costs across major programs. In contrast, in 2022, the DoD net costs increased by a much larger amount of $568.4 billion primarily due to changes in assumptions related to long-term benefits liabilities. However, both years saw increases in net costs across major programs such as military operations, readiness, support, procurement, personnel, and R&D.
Let’s drill into the console output to understand what exactly happened under the hood:
A few key points:
- Two sub-questions were generated, as indicated on the first line in the screenshot above.
- Each question/answer pair is highlighted in different colors.
- The first sub-question “What are the DoD costs in 2021” returned response “The DoD net costs increased by $144.8 billion in 2021 due to a $100.2 billion loss increase from changes in assumptions referenced above, as well as slight increases in net costs across DOD�s major programs, including military operations, readiness, support, procurement, personnel, and R&D.”.
- The second sub-question “What are the DoD costs in 2022” returned response “The DoD net costs increased by $568.4 billion in 2022 primarily due to a $444.2 billion loss increase from changes in assumptions related to actuarial projections of long-term benefits liabilities. However, the majority of DoD’s net costs included military operations, readiness and support, procurement, personnel, and R&D, which also collectively increased.”.
- The two sub-responses were gathered and synthesized into the final answer. This process was not reflected in the console output, but that’s what SubQuestionQueryEngine does after both query engines returned their respective responses to their sub-questions.
Question #2: “Compare and contrast the total liabilities between 2021 and 2022”