Observations on Building RAG Systems for Technical Documents

Sumit Soman and Sujoy Roychowdhury
{sumit.soman, sujoy.roychowdhury}@ericsson.com
Global AI Accelerator, Ericsson R&D, Bangalore, India. Both authors contributed equally. Git Repo Link.

Abstract

Retrieval augmented generation (RAG) for technical documents creates challenges as embeddings do not often capture domain information. We review prior art for important factors affecting RAG and perform experiments to highlight best practices and potential challenges to build RAG systems for technical documents.

1 Introduction

Long form Question Answering (QA) involves generating paragraph-size responses from Large Language Models (LLMs). RAG for technical documents has several challenges Xu et al. (2023); Toro et al. (2023). Factors affecting retrieval performance, including in-context documents, LLMs and metrics, have been evaluated Chen et al. (2023a). To further build on this work, we conduct experiments on technical documents with telecom and battery terminology to examine the influence of chunk length, keyword-based search and ranks (sequence) of retrieved results in the RAG pipeline.

2 Experimental Setup

Our experiments are based on IEEE Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) specifications IEEE (2021) and IEEE Standard Glossary of Stationary Battery Terminology 1881-2016 (2016). We separately process the glossary of definitions and the full document, as many expected questions are based on the definitions. We source questions based on domain knowledge and report experimental results on 42 representative queries across the documents. Multiple embedding models can be used, Reimers & Gurevych (2019), we use MPNET Song et al. (2020) for the entire document - excluding tables and captions. For the glossary, we split the term and the definition and generate separate embeddings for them, as well as for the full paragraph having the defined term and the definition. Soman & HG (2023) have reviewed other LLMs for telecom domain, but we chose llama2-7b-chat model Touvron et al. (2023) as it is free and has a commercial-friendly license. We evaluate on multiple questions and report on selected questions to substantiate our observations. For reference, the prompts used for the LLM are provided in Appendix A.

3 Observations

We first observe that sentence embeddings become unreliable with increasing chunk size. Appendix B Fig. 1 shows the Kernel Density Estimate (KDE) plot of cosine similarity scores for various sentence lengths. We take 10,970 sentences and look at pairwise similarity for all the sentences. A high similarity is observed when the length of the sentences is relatively long. The higher similarity distribution for larger lengths indicates spurious similarities which we manually validate for a few samples. We find that when both the query and queried document are over 200 words, the similarity distribution is bimodal. When either of them are over 200 words, there is a small but less perceptible lift at higher similarities.

Hyp	Hypothesis	Observation	Support (Samples)
H1	Splitting definition and defined words help in queries	For definitions, using the defined word and definition separately for retrieval gives better performance	22 of 30 queries (ID 2, 3)
H2	Similarity scores should not be used to compare retrieved results	We observe that similarity scores between different approaches are not comparable and absolute values are often very small for correct answers	24 of 30 queries (ID 2, 3)
H3	Position of keywords matter	Keywords closer to the beginning of the sentence are retrieved with high accuracy	25 of 30 queries
H3	Position of keywords matter	Keywords which occur later in the sentence are difficult to be retrieved	(ID 1, 4, 5, 6)
H4	Sentence Based Similarity is better	Similarity based on sentence and distinct paragraphs retrieved gives much detailed context to generator	ID F1 - Table 2 (8 of 10 queries)
H5	Generator for sentence based similarity	Generated answer using sentence based similarity and paragraph based retrieval gives better results	8 of 10 queries (App. Table 3 - ID F1)
H6	Definitions with acronyms or words having acronyms don’t perform well	Generated answers often expand or provide abbreviations which is not helpful	15 of 16 queries (App. Table 3 - ID F2, F3)
H7	Order of retrieved paragraphs in generator results	Order of retrieved paragraphs do not affect generator results in our experiments	NA

Table 1: Summary of observations - details of individual queries in Appendix B

Table 1 summarizes our hypotheses and key observations - corresponding sample queries and their results are provided in Appendix C. We hypothesize that splitting on definition and terms can help improve results (H1), similarity scores being a good measure (H2), position of keywords influencing results (H3), sentence-based similarity resulting in a better retriever (H4) and generator (H5), answers for definitions based on acronyms (H6) and effect of order of retrieved results on generator performance (H7). Of these, H2 is a result of our experiments with distributions of similarity scores referred earlier and H7 is based on Chen et al. (2023a). Others are derived from our experiments to improve results. For each hypotheses, we provide the number of experiments that support the claim and those that are valid for the same in the last column, along with sample queries.

We find that retrieval by thresholding on similarity scores is not helpful. For queries 1, 2 and 5, when the query phrase is present in the term or definition, top retrieved score is higher. For query 3, the correct result is retrieved at the second position using definition embedding, but in other cases, result is not retrieved and similarity scores are close. For queries 4 and 6, we are unable to retrieve the correct result, though scores indicate otherwise. Thus, thresholding retriever results based on similarity scores can potentially result in sub-optimal generator augmentation. We evaluate generator performance on our queries based on the retrieved results. This is done using the top $k$ retrieved (a) definitions, and (b) terms and definitions. Better context gives better generated responses. For acronyms and their expansions, the generator does not add any additional value.

For retrieval on the full document, we explore similarity search by sentence and paragraph separately. In the former, we retrieve the paragraph to which the sentence belongs and take top- $k$ distinct paragraphs from top similar sentences. We observe that the results by sentence-based similarity search and paragraphs being used for generator provides better retriever and generator performance. Authors in Chen et al. (2023a) mention order of presented information to be important, but we did not observe different results on permuting the retrieved paragraphs. We observe generator responses to sometimes fail due to incorrect retrieval, hallucinated facts or incorrect synthesis as highlighted in Chen et al. (2023a). We recommend such approaches for definition QA and long form QA.

4 Conclusions and Future Work

We show that chunk length affects retriever embeddings, and generator augmentation by thresholding retriever results on similarity scores can be unreliable. However, use of abbreviations and a large number of related paragraphs for a topic make our observations particularly relevant for long form QA on technical documents. As future work, we would like to use RAG metrics Es et al. (2023); Chen et al. (2023b) to choose retrieval strategies. Also, methods and evaluation metrics to answer follow-up questions would be of interest.

URM Statement

The authors acknowledge that at least one key author of this work meets the URM criteria of ICLR 2024 Tiny Papers Track.

References

1881-2016 (2016) IEEE 1881-2016. IEEE standard glossary of stationary battery terminology. IEEE Std 1881-2016, pp. 1–42, 2016. doi: 10.1109/IEEESTD.2016.7552407.
Chen et al. (2023a) Hung-Ting Chen, Fangyuan Xu, Shane A Arora, and Eunsol Choi. Understanding retrieval augmentation for long-form question answering. arXiv preprint arXiv:2310.12150, 2023a.
Chen et al. (2023b) Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. Benchmarking large language models in retrieval-augmented generation. arXiv preprint arXiv:2309.01431, 2023b.
Es et al. (2023) Shahul Es, Jithin James, Luis Espinosa-Anke, and Steven Schockaert. Ragas: Automated evaluation of retrieval augmented generation. arXiv preprint arXiv:2309.15217, 2023.
IEEE (2021) IEEE. IEEE standard for information technology–telecommunications and information exchange between systems - local and Metropolitan Area Networks–specific requirements - part 11: Wireless LAN medium access control (MAC) and physical layer (PHY) specifications. IEEE Std 802.11-2020 (Revision of IEEE Std 802.11-2016), pp. 1–4379, 2021. doi: 10.1109/IEEESTD.2021.9363693.
Reimers & Gurevych (2019) Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, 2019.
Soman & HG (2023) Sumit Soman and Ranjani HG. Observations on LLMs for telecom domain: Capabilities and limitations (To appear in the proceedings of The Third International Conference on Artificial Intelligence and Machine Learning Systems). arXiv preprint arXiv:2305.13102, 2023.
Song et al. (2020) Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. MPNET: Masked and permuted pre-training for language understanding. Advances in Neural Information Processing Systems, 33:16857–16867, 2020.
Toro et al. (2023) Sabrina Toro, Anna V Anagnostopoulos, Sue Bello, Kai Blumberg, Rhiannon Cameron, Leigh Carmody, Alexander D Diehl, Damion Dooley, William Duncan, Petra Fey, et al. Dynamic retrieval augmented generation of ontologies using artificial intelligence (DRAGON-AI). arXiv preprint arXiv:2312.10904, 2023.
Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
Xu et al. (2023) Benfeng Xu, Chunxu Zhao, Wenbin Jiang, Pengfei Zhu, Songtai Dai, Chao Pang, Zhuo Sun, Shuohuan Wang, and Yu Sun. Retrieval-augmented domain adaptation of language models. In Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP 2023), pp. 54–64, 2023.

Appendix A Appendix A

The prompts used for the LLM in our experiments are as follows:

•

System Prompt: Answer the questions based on the paragraphs provided here. DO NOT use any other information except that in the paragraphs. Keep the answers as short as possible. JUST GIVE THE ANSWER. NO PREAMBLE REQUIRED.
•

User Prompt: “PARAGRAPHS : ”+context + “QUESTIONS: ” + query

Appendix B Appendix B

Refer to caption — Figure 1: The distribution of similarities across 10974 documents of various sizes split by number of words in the document

Appendix C Appendix C - Supplementary Material

We provide an anonymized Git repository which contains

•

Anonymized source code
•

Experiment v/s hypothesis tabulation (for consolidated quantitative results)
•

Details of the experiments across 42 queries and 7 hypothesis

In addition, we provide details with respect to hypotheses in Table 1 by providing sample queries and the retrieved and generated results. See pages - of ICLR_Submission_Findings_Sample.pdf