Build an Index and Query

We run this experiment on a localhost as it does not require intensive computations. We need to install ES and Milvus as specified here.

In index and query use case, users provide a collection of documents such as text files or webpages to build a retriever. Users can then ask questions to obtain relevant results from the provided documents. The code for this use case is at index_and_query_from_docs.py. To run this use case, go to denser-retriever repo and run the following command

poetry run python experiments/index_and_query_from_docs.py

If the run is successful, we would expect to see something similar to the following.

2024-05-27 12:00:55 INFO: ES ingesting passages.jsonl record 96
2024-05-27 12:00:55 INFO: Done building ES index
2024-05-27 12:00:55 INFO: Remove existing Milvus index state_of_the_union
2024-05-27 12:00:59 INFO: Milvus vector DB ingesting passages.jsonl record 96
2024-05-27 12:01:03 INFO: Done building Vector DB index
[{'source': 'tests/test_data/state_of_the_union.txt',
'text': 'One of the most serious constitutional responsibilities...',
'title': '', 'pid': 73,
'score': -1.6985594034194946}]

Build and query a retriever from a text file

Overview

The index and query use case consists of two steps:

  1. Build a denser retriever from a text file or a webpage.
  2. Query a retriever to obtain relevant results.

To support additional types of files such as PDF, users can refer to LangChain file loaders.

The following diagram illustrates a denser retriever, which consists of three components:

image

  • Keyword search relies on traditional search techniques that use exact keyword matching. We use elasticsearch in denser retriever.
  • Vector search uses neural network models to encode both the query and the documents into dense vector representations in a high-dimensional space. We use Milvus and snowflake-arctic-embed-m model, which achieves state-of-the-art performance on the MTEB/BEIR leaderboard for each of their size variants.
  • A ML cross-encoder re-ranker can be utilized to further boost accuracy over these two retriever approaches above. We use cross-encoder/ms-marco-MiniLM-L-6-v2, which has a good balance between accuracy and inference latency.

In the following section, we will explain the underlying processes and mechanisms involved.

Prepare the config file

We config the above three components in the following yaml file (available at repo). Most of the parameters are self-explanatory. The sections of keyword, vector, rerank config the Elasticsearch, Milvus, and reranker respectively.

We uses combine: model to combine Elasticsearch, Milvus and reranker via a xgboost model experiments/models/msmarco_xgb_es+vs+rr_n.json, which was trained using mteb msmarco dataset (see the training recipe on how to train such a model). Besides the model combination, we can also use linear or rank to combine Elasticsearch, Milvus and reranker. The experiments on MTEB datasets suggest that the model combination can lead to significantly higher accuracy than the linear or rank methods.

Some parameters, for example, es_ingest_passage_bs, are only used in training a xgboost model (i.e. not needed in query stage).

version: "0.1"

# linear, rank or model
combine: model
keyword_weight: 0.5
vector_weight: 0.5
rerank_weight: 0.5
model: ./experiments/models/msmarco_xgb_es+vs+rr_n.json
model_features: es+vs+rr_n

keyword:
  es_user: elastic
  es_passwd: YOUR_ES_PASSWORD
  es_host: http://localhost:9200
  es_ingest_passage_bs: 5000
  topk: 100

vector:
  milvus_host: localhost
  milvus_port: 19530
  milvus_user: root
  milvus_passwd: Milvus
  emb_model: Snowflake/snowflake-arctic-embed-m
  emb_dims: 768
  one_model: false
  vector_ingest_passage_bs: 2000 # 1000
  topk: 100

rerank:
  rerank_model: cross-encoder/ms-marco-MiniLM-L-6-v2
  rerank_bs: 100
  topk: 100

output_prefix: ./denser_output_retriever/

max_doc_size: 0
max_query_size: 10000

Generate passages

We now describe how to build a retriever from a given text file: the state_of_the_union.txt. The following code shows how to read the text file, split the file to text chunks and save them to a jsonl file passages.jsonl.

from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from denser_retriever.utils import save_HF_docs_as_denser_passages
from denser_retriever.retriever_general import RetrieverGeneral

# Generate text chunks
documents = TextLoader("tests/test_data/state_of_the_union.txt").load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
texts = text_splitter.split_documents(documents)
passage_file = "passages.jsonl"
save_HF_docs_as_denser_passages(texts, passage_file, 0)

Each line in passages.jsonl is a passage, which contains fields of source, title, text and pid (passage id).

{"source": "tests/test_data/state_of_the_union.txt",
"title": "",
"text": "Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny.",
"pid": 0}

Build a Denser retriever

We can build a Denser retriever with the given passages.jsonl and experiments/config_local.yaml config file.

# Build denser index
retriever_denser = RetrieverGeneral("state_of_the_union", "experiments/config_local.yaml")
retriever_denser.ingest(passage_file)

Query a Denser retriever

We can simply use the following code to query a retriever to obtain relevant passages.

# Query
query = "What did the president say about Ketanji Brown Jackson"
passages, docs = retriever_denser.retrieve(query, {})
print(passages)

Each returned passage receives a confidence score to indicate how relevant it is to the given query. We get something similar to the following.

[{'source': 'tests/test_data/state_of_the_union.txt',
'text': 'One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.',
'title': '',
'pid': 73,
'score': -0.6116511225700378}]

Put everything together

We put all code together as follows. The code is also available at repo.

from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from denser_retriever.utils import save_HF_docs_as_denser_passages
from denser_retriever.retriever_general import RetrieverGeneral

# Generate text chunks
documents = TextLoader("tests/test_data/state_of_the_union.txt").load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
texts = text_splitter.split_documents(documents)
passage_file = "passages.jsonl"
save_HF_docs_as_denser_passages(texts, passage_file, 0)

# Build denser index
retriever_denser = RetrieverGeneral("state_of_the_union", "experiments/config_local.yaml")
retriever_denser.ingest(passage_file)

# Query
query = "What did the president say about Ketanji Brown Jackson"
passages, docs = retriever_denser.retrieve(query, {})
print(passages)

Build and query a retriever from a webpage

Building a retriever from a webpage is similar to the above, except for the passage corpus generation. The index_and_query_from_webpage.py source code can be found at here.

To run this use case, go to denser-retriever repo and run:

poetry run python experiments/index_and_query_from_webpage.py

If successful, we expect to see somthing similar to the following.

2024-05-27 12:10:47 INFO: ES ingesting passages.jsonl record 66
2024-05-27 12:10:47 INFO: Done building ES index
2024-05-27 12:10:52 INFO: Milvus vector DB ingesting passages.jsonl record 66
2024-05-27 12:10:56 INFO: Done building Vector DB index
[{'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/',
'text': 'Fig. 1. Overview of a LLM-powered autonomous agent system...',
'title': '',
'pid': 2,
'score': -1.6985594034194946}]

Last updated on