Requirement Document for RAG-based MVP
1. Project Overview
This project aims to develop an AI-assisted application using Retrieval-Augmented Generation (RAG) to enable users to interact with a document base through natural language queries.
What is RAG?
RAG (Retrieval-Augmented Generation) is a technique that combines the power of large language models (LLMs) with a retrieval system. It allows the AI to access and use external knowledge when generating responses, rather than relying solely on its pre-trained knowledge.
How RAG works in this project:
- Document Ingestion: The system will process and store information from various document types (PDF, Word, text).
- Indexing: Create searchable embeddings of the document content.
- Query Processing: When a user asks a question, the system finds relevant information from the document base.
- Context Augmentation: The retrieved information is used to augment the AI’s knowledge.
- Response Generation: The AI generates a response based on the query and the retrieved context.
Key Components:
- Document Processor: Extracts text and metadata from various file types.
- Embedding Model: Converts text into vector representations.
- Vector Database: Stores and enables efficient searching of text embeddings.
- Retrieval System: Finds relevant information based on the user’s query.
- Language Model (LLM): Generates human-like responses (using Groq API).
- Translation Service: Handles German and English inputs/outputs.
Developer Notes:
- Familiarize yourself with the concept of embeddings and vector similarity search.
- Understand the basics of how large language models work.
- Be prepared to work with APIs for language models and possibly translation services.
2. Functional Requirements
2.1 User Interface
Implement a chat-like interface using a web framework. For an MVP with a low entry barrier, we recommend using Streamlit.
Implementation Guide:
- Install Streamlit:
pip install streamlit
- Create a main Python file (e.g.,
app.py
) with a basic structure:
import streamlit as st
def main():
st.title("Document Chat MVP")
user_input = st.text_input("Enter your question:")
if st.button("Submit"):
# Process query and generate response
response = process_query(user_input)
st.write(response)
if __name__ == "__main__":
main()
- Run the app with:
streamlit run app.py
2.2 Document Processing
Implement a system to ingest and process various document types.
Implementation Guide:
- Install necessary libraries:
pip install PyPDF2 python-docx nltk
- Create a
document_processor.py
file:
import PyPDF2
from docx import Document
import nltk
nltk.download('punkt')
def extract_text(file_path):
if file_path.endswith('.pdf'):
return extract_from_pdf(file_path)
elif file_path.endswith('.docx'):
return extract_from_docx(file_path)
elif file_path.endswith('.txt'):
return extract_from_txt(file_path)
else:
raise ValueError("Unsupported file type")
def extract_from_pdf(file_path):
with open(file_path, 'rb') as file:
reader = PyPDF2.PdfReader(file)
return ' '.join([page.extract_text() for page in reader.pages])
def extract_from_docx(file_path):
doc = Document(file_path)
return ' '.join([para.text for para in doc.paragraphs])
def extract_from_txt(file_path):
with open(file_path, 'r', encoding='utf-8') as file:
return file.read()
def chunk_text(text, chunk_size=1000):
return nltk.sent_tokenize(text)
2.3 Language Support
Use a pre-trained language detection model and a translation API for language support.
Implementation Guide:
- Install necessary libraries:
pip install langdetect googletrans==3.1.0a0
- Create a
language_utils.py
file:
from langdetect import detect
from googletrans import Translator
def detect_language(text):
return detect(text)
def translate_text(text, target_lang):
translator = Translator()
return translator.translate(text, dest=target_lang).text
2.4 API Integration
Integrate with Groq API for response generation.
Implementation Guide:
- Install the Groq Python client:
pip install groq
- Create a
groq_integration.py
file:
import os
from groq import Groq
client = Groq(api_key=os.environ["GROQ_API_KEY"])
def generate_response(prompt):
chat_completion = client.chat.completions.create(
messages=[
{
"role": "user",
"content": prompt,
}
],
model="mixtral-8x7b-32768",
max_tokens=1024,
)
return chat_completion.choices[0].message.content
3. Non-Functional Requirements
3.1 Performance
To ensure good performance, focus on efficient data processing and caching mechanisms.
Implementation Guide:
- Use asynchronous programming for I/O-bound operations:
- Install
aiohttp
:pip install aiohttp
- Modify
groq_integration.py
to use async calls:
import aiohttp
import asyncio
from groq import AsyncGroq
async def generate_response_async(prompt):
async with AsyncGroq(api_key=os.environ["GROQ_API_KEY"]) as client:
chat_completion = await client.chat.completions.create(
messages=[{"role": "user", "content": prompt}],
model="mixtral-8x7b-32768",
max_tokens=1024,
)
return chat_completion.choices[0].message.content
# Usage in main app
response = asyncio.run(generate_response_async(prompt))
- Implement caching for document embeddings:
- Install
cachetools
:pip install cachetools
- Add caching to
document_processor.py
:
from cachetools import TTLCache
# Cache for 1 hour, max 100 items
embedding_cache = TTLCache(maxsize=100, ttl=3600)
def get_embedding(text):
if text in embedding_cache:
return embedding_cache[text]
embedding = compute_embedding(text) # Your embedding function
embedding_cache[text] = embedding
return embedding
3.2 Security
Implement secure API key management and ensure document content is securely stored.
Implementation Guide:
- Use environment variables for API keys:
- Create a
.env
file in the project root (add to .gitignore) - Install
python-dotenv
:pip install python-dotenv
- Load environment variables in your main app:
from dotenv import load_dotenv
import os
load_dotenv()
GROQ_API_KEY = os.getenv('GROQ_API_KEY')
- Basic authentication for the web app:
import streamlit as st
def check_password():
def password_entered():
if st.session_state["password"] == st.secrets["password"]:
st.session_state["password_correct"] = True
del st.session_state["password"]
else:
st.session_state["password_correct"] = False
if "password_correct" not in st.session_state:
st.text_input(
"Password", type="password", on_change=password_entered, key="password"
)
return False
elif not st.session_state["password_correct"]:
st.text_input(
"Password", type="password", on_change=password_entered, key="password"
)
st.error("😕 Password incorrect")
return False
else:
return True
if check_password():
# Your main app code here
3.3 Usability
Ensure the user interface is intuitive and provides clear instructions.
Implementation Guide:
- Add tooltips and help text in Streamlit:
st.text_input("Enter your question:", help="Type your question in German or English")
st.selectbox("Select output language", ["German", "English"], help="Choose the language for the answer")
- Implement a simple onboarding flow:
def show_onboarding():
st.markdown("""
# Welcome to Document Chat MVP
Here's how to use this app:
1. Enter your question in the text box
2. Select your preferred answer language
3. Click 'Submit' to get your answer
The AI will search through the document base and provide the most relevant answer.
""")
if st.button("Got it!"):
st.session_state.onboarding_complete = True
if 'onboarding_complete' not in st.session_state:
show_onboarding()
else:
# Main app code
4. Technical Stack
This section provides a detailed guide on setting up and using the recommended technical stack for the MVP.
Backend: Python
Python is ideal for NLP and AI tasks due to its rich ecosystem of libraries.
Setup:
- Install Python 3.8+ from https://www.python.org/downloads/
- Set up a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
Frontend: Streamlit
Streamlit allows for rapid MVP development with a Python-based web interface.
Setup:
- Install Streamlit:
pip install streamlit
- Create a
requirements.txt
file with all dependencies - Run your app:
streamlit run app.py
Document Processing
We’ll use PyPDF2 for PDF files and python-docx for Word files.
Setup:
- Install libraries:
pip install PyPDF2 python-docx
- Basic usage in
document_processor.py
:
import PyPDF2
from docx import Document
def process_pdf(file_path):
with open(file_path, 'rb') as file:
reader = PyPDF2.PdfReader(file)
text = ''
for page in reader.pages:
text += page.extract_text()
return text
def process_docx(file_path):
doc = Document(file_path)
return ' '.join([para.text for para in doc.paragraphs])
Embedding Model: Sentence-BERT
Sentence-BERT provides high-quality text embeddings and supports multiple languages.
Setup:
- Install the library:
pip install sentence-transformers
- Basic usage in
embedding.py
:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
def get_embedding(text):
return model.encode(text)
Vector Database: FAISS
FAISS is efficient for similarity search and works well with Sentence-BERT embeddings.
Setup:
- Install FAISS:
pip install faiss-cpu
- Basic usage in
vector_store.py
:
import faiss
import numpy as np
class VectorStore:
def __init__(self, dimension):
self.index = faiss.IndexFlatL2(dimension)
self.texts = []
def add(self, embedding, text):
self.index.add(np.array([embedding]))
self.texts.append(text)
def search(self, query_embedding, k=5):
distances, indices = self.index.search(np.array([query_embedding]), k)
return [self.texts[i] for i in indices[0]]
LLM Integration: Groq API
Groq API will be used for generating responses based on retrieved context.
Setup:
- Install the Groq client:
pip install groq
- Basic usage in
groq_client.py
:
import os
from groq import Groq
client = Groq(api_key=os.environ["GROQ_API_KEY"])
def generate_response(prompt):
completion = client.chat.completions.create(
model="mixtral-8x7b-32768",
messages=[{"role": "user", "content": prompt}],
max_tokens=1024
)
return completion.choices[0].message.content
Putting it All Together
Create a main.py
file that integrates all components:
import streamlit as st
from document_processor import process_pdf, process_docx
from embedding import get_embedding
from vector_store import VectorStore
from groq_client import generate_response
# Initialize components
vector_store = VectorStore(384) # Dimension of Sentence-BERT embeddings
# Streamlit UI
st.title("Document Chat MVP")
# File uploader
uploaded_file = st.file_uploader("Choose a file", type=["pdf", "docx"])
if uploaded_file:
# Process and index the document
if uploaded_file.type == "application/pdf":
text = process_pdf(uploaded_file)
else:
text = process_docx(uploaded_file)
embedding = get_embedding(text)
vector_store.add(embedding, text)
# Query input
query = st.text_input("Enter your question:")
if query:
query_embedding = get_embedding(query)
relevant_texts = vector_store.search(query_embedding)
context = "\n".join(relevant_texts)
prompt = f"Context: {context}\n\nQuestion: {query}\n\nAnswer:"
response = generate_response(prompt)
st.write(response)
5. RAG Implementation Guidelines
This section provides detailed instructions on implementing the Retrieval-Augmented Generation (RAG) system for our MVP.
1. Document Ingestion
Create a document_ingestion.py
file to handle the document processing pipeline:
import PyPDF2
from docx import Document
import nltk
from sentence_transformers import SentenceTransformer
from vector_store import VectorStore
nltk.download('punkt')
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
vector_store = VectorStore(384) # Dimension of the chosen model
def extract_text(file_path):
if file_path.endswith('.pdf'):
return extract_from_pdf(file_path)
elif file_path.endswith('.docx'):
return extract_from_docx(file_path)
elif file_path.endswith('.txt'):
with open(file_path, 'r', encoding='utf-8') as file:
return file.read()
else:
raise ValueError("Unsupported file type")
def extract_from_pdf(file_path):
with open(file_path, 'rb') as file:
reader = PyPDF2.PdfReader(file)
return ' '.join([page.extract_text() for page in reader.pages])
def extract_from_docx(file_path):
doc = Document(file_path)
return ' '.join([para.text for para in doc.paragraphs])
def chunk_text(text, chunk_size=1000):
sentences = nltk.sent_tokenize(text)
chunks = []
current_chunk = []
current_size = 0
for sentence in sentences:
if current_size + len(sentence) > chunk_size and current_chunk:
chunks.append(' '.join(current_chunk))
current_chunk = []
current_size = 0
current_chunk.append(sentence)
current_size += len(sentence)
if current_chunk:
chunks.append(' '.join(current_chunk))
return chunks
def process_document(file_path):
text = extract_text(file_path)
chunks = chunk_text(text)
for chunk in chunks:
embedding = model.encode(chunk)
vector_store.add(embedding, chunk)
2. Text Embedding
We’ve already integrated the embedding process in the document ingestion step. For query embedding, create a query_processing.py
file:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
def embed_query(query):
return model.encode(query)
3. Retrieval Process
Enhance the vector_store.py
file to include a more sophisticated retrieval process:
import faiss
import numpy as np
class VectorStore:
def __init__(self, dimension):
self.index = faiss.IndexFlatL2(dimension)
self.texts = []
def add(self, embedding, text):
self.index.add(np.array([embedding]))
self.texts.append(text)
def search(self, query_embedding, k=5):
distances, indices = self.index.search(np.array([query_embedding]), k)
results = []
for i, idx in enumerate(indices[0]):
results.append({
'text': self.texts[idx],
'score': 1 / (1 + distances[0][i]) # Convert distance to similarity score
})
return sorted(results, key=lambda x: x['score'], reverse=True)
4. Context Preparation
Create a context_preparation.py
file to handle the selection and formatting of retrieved context:
def prepare_context(search_results, max_tokens=3000):
context = ""
total_tokens = 0
for result in search_results:
if total_tokens + len(result['text'].split()) > max_tokens:
break
context += result['text'] + "\n\n"
total_tokens += len(result['text'].split())
return context.strip()
5. Response Generation
Enhance the groq_client.py
file to include prompt formatting:
import os
from groq import Groq
client = Groq(api_key=os.environ["GROQ_API_KEY"])
def generate_response(query, context):
prompt = f"""Given the following context, please answer the question. If the answer is not contained within the context, say "I don't have enough information to answer that question."
Context:
{context}
Question: {query}
Answer:"""
completion = client.chat.completions.create(
model="mixtral-8x7b-32768",
messages=[{"role": "user", "content": prompt}],
max_tokens=1024
)
return completion.choices[0].message.content
6. Post-processing
Create a post_processing.py
file to handle translation and formatting:
from googletrans import Translator
translator = Translator()
def translate_if_needed(text, target_lang):
detected_lang = translator.detect(text).lang
if detected_lang != target_lang:
return translator.translate(text, dest=target_lang).text
return text
def format_response(response):
# Add any additional formatting here
return response
6. Groq API Integration
Let’s explore how to effectively integrate the Groq API into our RAG-based MVP. While the setup process is straightforward, paying attention to the implementation details will ensure robust and secure integration.
1. Obtaining and Setting Up the Groq API Key
- Sign up for a Groq account at https://console.groq.com
- Navigate to the API Keys section in the Groq Cloud console
- Click “Create API Key” and give it a descriptive name (e.g., “RAG-MVP”)
- Copy the generated API key immediately and store it securely
2. Secure API Key Management
Create a .env
file in the project root directory:
GROQ_API_KEY=your_groq_api_key_here
Add .env
to your .gitignore
file to prevent accidentally committing it:
echo ".env" >> .gitignore
3. Installing Required Libraries
Install the Groq Python client and python-dotenv:
pip install groq python-dotenv
4. Groq API Integration
Create a new file named groq_integration.py
:
import os
from dotenv import load_dotenv
from groq import Groq
# Load environment variables
load_dotenv()
# Initialize Groq client
client = Groq(api_key=os.getenv("GROQ_API_KEY"))
def generate_response(query, context, max_tokens=1024):
prompt = f"""You are an AI assistant tasked with answering questions based on the given context. Please provide a concise and accurate answer to the question. If the information is not available in the context, state that you don't have enough information to answer the question.
Context:
{context}
Question: {query}
Answer:"""
try:
chat_completion = client.chat.completions.create(
messages=[
{
"role": "user",
"content": prompt,
}
],
model="mixtral-8x7b-32768",
max_tokens=max_tokens,
temperature=0.7,
)
return chat_completion.choices[0].message.content
except Exception as e:
print(f"Error generating response: {e}")
return "I apologize, but I encountered an error while generating the response. Please try again later."
def generate_followup_questions(query, context, answer):
prompt = f"""Based on the original question, the provided context, and the given answer, generate three follow-up questions that the user might ask next. These questions should be relevant and help explore the topic further.
Original Question: {query}
Context:
{context}
Answer: {answer}
Generate three follow-up questions:"""
try:
chat_completion = client.chat.completions.create(
messages=[
{
"role": "user",
"content": prompt,
}
],
model="mixtral-8x7b-32768",
max_tokens=200,
temperature=0.8,
)
return chat_completion.choices[0].message.content.split("\n")
except Exception as e:
print(f"Error generating follow-up questions: {e}")
return []
5. Error Handling and Rate Limiting
To handle potential API errors and implement rate limiting, create a new file named api_utils.py
:
import time
from functools import wraps
def rate_limit(max_per_minute):
min_interval = 60.0 / max_per_minute
last_called = [0.0]
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
elapsed = time.time() - last_called[0]
left_to_wait = min_interval - elapsed
if left_to_wait > 0:
time.sleep(left_to_wait)
ret = func(*args, **kwargs)
last_called[0] = time.time()
return ret
return wrapper
return decorator
@rate_limit(max_per_minute=60) # Adjust this value based on your API limits
def api_call(func, *args, **kwargs):
max_retries = 3
for attempt in range(max_retries):
try:
return func(*args, **kwargs)
except Exception as e:
if attempt == max_retries - 1:
raise
print(f"API call failed. Retrying... (Attempt {attempt + 1}/{max_retries})")
time.sleep(2 ** attempt) # Exponential backoff
7. Testing and Evaluation
Let’s examine the practical implementation of testing and evaluation strategies that ensure our RAG system delivers reliable, production-ready results.
1. Unit Testing
Create a tests
directory in your project root and add the following test files:
test_document_processing.py
:
import unittest
from document_ingestion import extract_text, chunk_text
class TestDocumentProcessing(unittest.TestCase):
def test_extract_text_pdf(self):
text = extract_text('tests/sample_files/sample.pdf')
self.assertIsInstance(text, str)
self.assertGreater(len(text), 0)
def test_extract_text_docx(self):
text = extract_text('tests/sample_files/sample.docx')
self.assertIsInstance(text, str)
self.assertGreater(len(text), 0)
def test_chunk_text(self):
text = "This is a sample text. It should be chunked properly. Let's see if it works correctly."
chunks = chunk_text(text, chunk_size=20)
self.assertIsInstance(chunks, list)
self.assertGreater(len(chunks), 1)
for chunk in chunks:
self.assertLessEqual(len(chunk), 20)
if __name__ == '__main__':
unittest.main()
test_embedding.py
:
import unittest
import numpy as np
from query_processing import embed_query
class TestEmbedding(unittest.TestCase):
def test_embed_query(self):
query = "What is the capital of France?"
embedding = embed_query(query)
self.assertIsInstance(embedding, np.ndarray)
self.assertEqual(embedding.shape, (384,)) # Assuming 384-dimensional embeddings
if __name__ == '__main__':
unittest.main()
test_vector_store.py
:
import unittest
import numpy as np
from vector_store import VectorStore
class TestVectorStore(unittest.TestCase):
def setUp(self):
self.vector_store = VectorStore(384)
def test_add_and_search(self):
embedding = np.random.rand(384)
text = "Sample text"
self.vector_store.add(embedding, text)
results = self.vector_store.search(embedding, k=1)
self.assertEqual(len(results), 1)
self.assertEqual(results[0]['text'], text)
if __name__ == '__main__':
unittest.main()
2. Integration Testing
Create an integration_tests.py
file in the tests
directory:
import unittest
from document_ingestion import process_document
from query_processing import embed_query
from vector_store import VectorStore
from context_preparation import prepare_context
from groq_integration import generate_response
class TestIntegration(unittest.TestCase):
def setUp(self):
self.vector_store = VectorStore(384)
process_document('tests/sample_files/sample.pdf')
def test_end_to_end(self):
query = "What is the main topic of the document?"
query_embedding = embed_query(query)
search_results = self.vector_store.search(query_embedding)
context = prepare_context(search_results)
response = generate_response(query, context)
self.assertIsInstance(response, str)
self.assertGreater(len(response), 0)
if __name__ == '__main__':
unittest.main()
3. Performance Testing
Create a performance_tests.py
file:
import time
import statistics
from document_ingestion import process_document
from query_processing import embed_query
from vector_store import VectorStore
from context_preparation import prepare_context
from groq_integration import generate_response
def measure_processing_time(func, *args):
start_time = time.time()
result = func(*args)
end_time = time.time()
return end_time - start_time, result
def run_performance_tests(num_iterations=10):
vector_store = VectorStore(384)
process_document('tests/sample_files/sample.pdf')
query = "What is the main topic of the document?"
embedding_times = []
search_times = []
context_prep_times = []
response_gen_times = []
for _ in range(num_iterations):
embed_time, query_embedding = measure_processing_time(embed_query, query)
embedding_times.append(embed_time)
search_time, search_results = measure_processing_time(vector_store.search, query_embedding)
search_times.append(search_time)
context_time, context = measure_processing_time(prepare_context, search_results)
context_prep_times.append(context_time)
response_time, _ = measure_processing_time(generate_response, query, context)
response_gen_times.append(response_time)
print(f"Embedding Time (avg): {statistics.mean(embedding_times):.4f}s")
print(f"Search Time (avg): {statistics.mean(search_times):.4f}s")
print(f"Context Preparation Time (avg): {statistics.mean(context_prep_times):.4f}s")
print(f"Response Generation Time (avg): {statistics.mean(response_gen_times):.4f}s")
if __name__ == '__main__':
run_performance_tests()
4. User Acceptance Testing (UAT)
Create a uat_guide.md
file in the project root:
# User Acceptance Testing Guide
## Test Cases
1. Document Upload
- Upload a PDF file
- Upload a DOCX file
- Upload a TXT file
- Attempt to upload an unsupported file type
2. Query Processing
- Ask a question directly related to the uploaded document
- Ask a question partially related to the uploaded document
- Ask a question unrelated to the uploaded document
3. Language Support
- Enter a query in English and select English as the output language
- Enter a query in German and select German as the output language
- Enter a query in English and select German as the output language
4. Response Quality
- Evaluate the relevance of the generated response
- Check if follow-up questions are contextually appropriate
5. Performance
- Measure response time for different types of queries
- Test the system with a large document (e.g., 100+ pages)
## Feedback Form
Please rate the following aspects on a scale of 1-5 (1 being poor, 5 being excellent):
1. Ease of use: [ ]
2. Response accuracy: [ ]
3. Response relevance: [ ]
4. Response time: [ ]
5. Overall user experience: [ ]
Additional comments:
[ ]
8. Deliverables
1. Functional MVP Application
The core deliverable is the functional MVP application. Ensure all components are integrated and working as expected:
- Document ingestion and processing
- Embedding and vector storage
- Query processing
- Context retrieval and preparation
- Response generation using Groq API
- Language support (German and English)
- User interface (Streamlit-based)
2. Source Code with Documentation
Organize the source code in a clear directory structure:
rag-mvp/
├── app.py
├── document_ingestion.py
├── query_processing.py
├── vector_store.py
├── context_preparation.py
├── groq_integration.py
├── post_processing.py
├── api_utils.py
├── evaluation.py
├── requirements.txt
├── .env.example
├── README.md
└── tests/
├── test_document_processing.py
├── test_embedding.py
├── test_vector_store.py
└── integration_tests.py
3. User Guide
Create a USER_GUIDE.md
file:
# RAG-based MVP User Guide
## Getting Started
1. Launch the application by running `streamlit run app.py`
2. Open the provided URL in your web browser
## Using the Application
### Uploading Documents
1. Click on the "Choose a file" button
2. Select a PDF, DOCX, or TXT file from your computer
3. Wait for the "Document processed and indexed successfully!" message
### Asking Questions
1. Type your question in the "Enter your question:" text box
2. Select your preferred answer language (German or English)
3. Press Enter or click outside the text box
### Interpreting Results
- The main answer to your question will appear under "Answer:"
- Three follow-up questions will be suggested below the main answer
- You can click on any follow-up question to ask it directly
### Tips for Best Results
- Be specific in your questions
- If you don't get a satisfactory answer, try rephrasing your question
- Upload multiple documents to expand the knowledge base
## Troubleshooting
- If the application is unresponsive, refresh the page and try again
- Ensure your internet connection is stable for API calls to work
- For technical issues, please refer to the README.md file in the project repository
4. Deployment Instructions
Create a DEPLOYMENT.md
file:
# Deployment Instructions
## Prerequisites
- Python 3.8+
- pip
- virtualenv (optional but recommended)
## Steps
1. Clone the repository:
bash
git clone https://github.com/your-repo/rag-mvp.git
cd rag-mvp
2. Create and activate a virtual environment (optional):
bash
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
3. Install dependencies:
bash
pip install -r requirements.txt
4. Set up environment variables:
- Copy `.env.example` to `.env`
- Add your Groq API key to the `.env` file
Run the application:
streamlit run app.py
5. Sample Document Set
- Create a
sample_docs
folder in the project root - Include various document types (PDF, DOCX, TXT)
- Ensure documents are free from copyright restrictions
- Create documentation about included samples
6. Performance and Evaluation Report
Generate a comprehensive report based on testing results and include metrics for:
- Response times
- Accuracy measurements
- User satisfaction scores
- System scalability assessments
These deliverables provide the foundation for a production-ready RAG system while maintaining flexibility for future enhancements and customizations.
Technology Stack Analysis: Making Sense of Our Tools
Core Document Processing Libraries
PyPDF2
What it does: Handles PDF file processing and text extraction
Why we chose it: While several PDF processing libraries exist, PyPDF2 offers the sweet spot between functionality and simplicity. It’s a pure Python library, which means:
- No complex dependencies to manage
- Straightforward installation across platforms
- Native text extraction capabilities
# Example of PyPDF2's straightforward implementation
from PyPDF2 import PdfReader
def extract_from_pdf(file_path):
with open(file_path, 'rb') as file:
reader = PdfReader(file)
return ' '.join([page.extract_text() for page in reader.pages])
python-docx
What it does: Processes Microsoft Word documents (.docx files)
Why we chose it: Working with Word documents requires reliable parsing while preserving document structure. Python-docx excels here because it:
- Maintains document hierarchy (paragraphs, sections)
- Handles formatted text effectively
- Provides intuitive access to document elements
# Clean, intuitive API for document processing
from docx import Document
def process_docx(file_path):
doc = Document(file_path)
return ' '.join([para.text for para in doc.paragraphs])
Natural Language Processing Tools
NLTK (Natural Language Toolkit)
What it does: Provides essential text processing capabilities
Why we chose it: For our RAG system’s document chunking needs, NLTK offers battle-tested sentence tokenization. Its advantages include:
- Robust sentence boundary detection
- Multi-language support
- Extensive documentation and community support
import nltk
nltk.download('punkt') # One-time download of tokenization models
def chunk_text(text, chunk_size=1000):
return nltk.sent_tokenize(text) # Smart sentence boundary detection
sentence-transformers
What it does: Generates text embeddings for semantic search
Why we chose it: This library makes working with state-of-the-art embedding models accessible. Key benefits:
- Pre-trained multilingual models
- Optimized for semantic similarity tasks
- Seamless integration with popular models
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
embeddings = model.encode(text) # Clean, one-line embedding generation
Vector Storage and Search
FAISS (Facebook AI Similarity Search)
What it does: Enables efficient similarity search for embeddings
Why we chose it: When dealing with document retrieval, performance matters. FAISS provides:
- Blazing-fast similarity search
- Memory-efficient index structures
- Scalability for large document collections
import faiss
import numpy as np
class VectorStore:
def __init__(self, dimension):
self.index = faiss.IndexFlatL2(dimension) # Simple but effective indexing
API Integration and Security
python-dotenv
What it does: Manages environment variables and configuration
Why we chose it: Secure API key management is crucial. Python-dotenv offers:
- Simple configuration management
- Secure credential handling
- Development/production environment separation
from dotenv import load_dotenv
import os
load_dotenv() # Automatically loads environment variables
api_key = os.getenv('GROQ_API_KEY')
Groq Client
What it does: Interfaces with Groq’s LLM API
Why we chose it: For reliable LLM integration, the official client provides:
- Robust error handling
- Rate limiting support
- Streamlined API interactions
from groq import Groq
client = Groq(api_key=os.environ["GROQ_API_KEY"])
# Clean, promise-based API interactions
Web Interface Development
Streamlit
What it does: Creates web-based user interfaces
Why we chose it: For rapid MVP development, Streamlit is unmatched in:
- Minimal boilerplate code
- Real-time updates
- Built-in widgets and components
- Python-native development
import streamlit as st
def create_interface():
st.title("Document Chat MVP")
query = st.text_input("Your question:") # Interactive elements in one line
Performance Optimization
cachetools
What it does: Implements caching mechanisms
Why we chose it: Efficient caching improves response times through:
- Memory-efficient cache implementations
- Flexible cache policies
- Thread-safe operations
from cachetools import TTLCache
# Time-based caching for expensive operations
embedding_cache = TTLCache(maxsize=100, ttl=3600)
Translation Support
googletrans
What it does: Provides translation capabilities
Why we chose it: For multilingual support, googletrans offers:
- Language detection
- Translation between multiple languages
- No API key requirements for basic usage
from googletrans import Translator
translator = Translator()
translated = translator.translate(text, dest='de') # Simple translation API
Testing Framework
pytest
What it does: Enables comprehensive testing
Why we chose it: For maintaining code quality, pytest provides:
- Intuitive test writing
- Powerful fixture system
- Extensive plugin ecosystem
import pytest
def test_document_processing():
assert process_document("test.pdf") is not None # Clear, expressive tests
These tools work together to create a robust RAG system where:
- Document processing is reliable and efficient
- Semantic search is fast and accurate
- API interactions are secure and manageable
- User interface is responsive and intuitive
- System performance is optimized and monitored
Each component was selected based on real-world implementation needs, balancing functionality with maintainability. This stack provides a solid foundation for both MVP development and future scaling.