This project is a Retrieval-Augmented Generation (RAG) Tool designed to help users better understand scientific studies and avoid misinformation propagated by clickbait or hyperbolic posts. The tool uses free and open-source technologies to provide insights, detect misinformation, and promote critical thinking when analyzing scientific claims.
Key features include:
- NLP with Hugging Face SciBERT to understand scientific text and communicate effectively with users from diverse socioeconomic backgrounds.
- FAISS for fast vector similarity search, ensuring quick retrieval of relevant study chunks.
- MongoDB for persistent storage of metadata, enabling advanced indexing by topic, discipline, and other dimensions.
- Python FastAPI to power backend services with minimal overhead and fast execution.
- Next.js for the frontend, enabling an interactive, SEO-friendly interface that supports RAG-based chat, multi-user support, and more.
The tool includes features like chunking, misinformation detection, bias awareness, and user feedback to iteratively improve accuracy and relevance. It supports multiple users, user invitations, and contribution forms for submitting scientific studies and articles for analysis.
The project emphasizes accessibility, transparency, and critical thinking by encouraging users to cross-reference multiple studies and learn from diverse perspectives.
# Retrieval-Augmented Generation (RAG) Tool for Scientific Literacy
## Introduction
This RAG tool is designed to combat misinformation by providing a clear and unbiased understanding of scientific studies. It leverages state-of-the-art Natural Language Processing (NLP) and information retrieval techniques to break down complex research papers and articles into accessible insights. The tool fosters critical thinking and transparency in science communication, making it valuable for individuals, researchers, and organizations aiming to understand the facts behind the headlines.
## Features
### 1. **Core Functionalities**
- **RAG-Based Chat**:
- Engages users in meaningful conversations about scientific studies.
- Allows users to **persist conversations** and invite others to participate.
- Multi-user support with admin privileges to manage access.
- **Misinformation Detection**:
- Identifies and flags misleading claims in scientific articles.
- Provides context-aware summaries to highlight biases or gaps.
- **Bias Awareness**:
- Informs users about potential biases in studies or claims.
- **User Feedback Loop**:
- Users can rate responses for relevance, accuracy, and helpfulness to iteratively improve the system.
### 2. **Data Ingestion**
- **Article Submission**:
- Submit article/blog links for review.
- Cross-check citations to ensure alignment with claims.
- **Scientific Study Submission**:
- Upload PDFs or provide links to studies.
- Metadata stored for retrieval (e.g., topic, discipline).
- **Web Scraping**:
- Uses BeautifulSoup to scrape and ingest content from websites.
### 3. **Text Processing**
- **Chunking**:
- Breaks studies into smaller chunks for embedding.
- Uses overlapping windows to maintain context during retrieval.
- **Indexing**:
- FAISS ensures fast vector search for chunked data.
- MongoDB stores metadata with indexing by topic, discipline, and other dimensions.
### 4. **Feedback and Continuous Improvement**
- Captures common user queries to improve FAQ-like responses.
- Encourages users to cross-reference studies for a balanced understanding.
---
## Technical Stack
### **Frontend**
- **Next.js**: Provides a modern, SEO-optimized, and interactive interface for:
- Chat-based interactions.
- User forms for invitations and submissions.
- Deployment on **Vercel (free tier)** with a custom domain.
### **Backend**
- **Python FastAPI**: Powers the backend with:
- API routes for interacting with FAISS and MongoDB.
- Minimal overhead, fast execution, and modern API design.
- **BeautifulSoup**: Scrapes content from submitted links.
### **Data Storage and Retrieval**
- **FAISS**: Ensures high-speed vector similarity search for study embeddings.
- **MongoDB**: Stores metadata for persistent indexing and search.
### **NLP**
- **Hugging Face SciBERT**:
- Pretrained on biomedical and scientific text.
- Accessible for users with diverse backgrounds and educational levels.
---
## Deployment
### **Frontend**
- Hosted on **Vercel** (free tier) with a custom domain.
- Provides global content delivery, automatic SSL, and seamless integration with Next.js.
### **Backend**
- Options for hosting:
1. **Render (free tier)**:
- Ideal for Python-based FastAPI services.
2. **Fly.io**:
- Supports free-tier deployments with scalability options.
3. **Deta**:
- Free for lightweight FastAPI apps.
---
## Strengths of the Chosen Approach
### **1. Scalability and Speed**
- **FAISS**: Handles large datasets quickly, ensuring a smooth user experience.
- **MongoDB**: Persistent metadata allows efficient indexing and retrieval.
### **2. Accessibility**
- **Hugging Face SciBERT**: Tailored for scientific language, breaking down barriers for users from diverse backgrounds.
### **3. Developer Productivity**
- **Next.js**: Simplifies frontend development with built-in SSR and API routes.
- **FastAPI**: Reduces backend complexity with its modern design and fast execution.
### **4. User Engagement**
- Multi-user chat capabilities and forms for contributions foster collaboration and community building.
---
## Weaknesses of the Approach
### **1. Resource Limitations**
- **FAISS and MongoDB**: Combining these adds complexity, especially in synchronization and scalability.
- **Free Hosting**: Limited resources on free-tier platforms may impact performance during high usage.
### **2. Learning Curve**
- Next.js and FastAPI have slight learning curves, particularly for developers unfamiliar with SSR/modern APIs.
### **3. NLP Model Limitations**
- Hugging Face models like SciBERT might require fine-tuning for optimal performance with specific data.
---
## Getting Started
### **1. Clone the Repository**
```bash
git clone https://github.com/your-username/scientific-rag-tool.git
cd scientific-rag-tool
We provide setup scripts for both Unix-based systems and Windows:
For Windows:
# Open PowerShell as Administrator and first enable script execution (if not already enabled)
Set-ExecutionPolicy RemoteSigned -Scope CurrentUser
# Run the setup script
.\setup.ps1
For Unix-based systems (Linux/macOS):
# Make the script executable
chmod +x setup.sh
# Run the setup script
./setup.sh
The setup scripts will:
- Create and activate a Python virtual environment
- Check for GPU availability
- Install appropriate dependencies based on your choice:
- Production: Minimal dependencies for running the application
- Development: Includes testing tools, documentation generators, and development utilities
- Create a
.env
file from template
For production:
# Create virtual environment
python -m venv venv
# Activate virtual environment
# On Windows (PowerShell):
.\venv\Scripts\Activate.ps1
# On Windows (Command Prompt):
venv\Scripts\activate.bat
# On Unix-based systems:
source venv/bin/activate
# Install production dependencies
pip install -r requirements.txt
For development:
# Create virtual environment
python -m venv venv
# Activate virtual environment
# On Windows (PowerShell):
.\venv\Scripts\Activate.ps1
# On Windows (Command Prompt):
venv\Scripts\activate.bat
# On Unix-based systems:
source venv/bin/activate
# Install development dependencies
pip install -r requirements-dev.txt
Copy the example environment file:
cp .env.example .env
Edit .env
with your configuration values.
Start the FastAPI backend:
uvicorn app.main:app --reload
Start the Next.js frontend:
npm run dev
When installed with development dependencies (requirements-dev.txt
), you have access to:
- Testing:
pytest
for running tests - Code Formatting:
black
andisort
for consistent code style - Type Checking:
mypy
for static type analysis - Documentation:
mkdocs
for generating documentation - Debugging: Support through
debugpy
- Notebooks: Jupyter notebooks for experimentation
Example development commands:
# Run tests
pytest
# Format code
black .
isort .
# Type checking
mypy .
# Generate documentation
mkdocs serve
python scripts/manage_cache.py --action cleanup --max-age 7d
python scripts/manage_cache.py --action cleanup --max-age 24h
python scripts/manage_cache.py --action cleanup --max-age 30m
python scripts/manage_cache.py --action cleanup --max-age 7d --format json
python scripts/manage_cache.py --action stats
python scripts/manage_cache.py --action clear --cache-type model
python scripts/manage_cache.py --action clear --cache-type all
Get JSON output for automation:
python scripts/manage_cache.py --action stats --format json
Try clearing the cache and running the migration again:
python scripts/manage_cache.py --action clear
python -m app.migrations.run_migrations
Follow deployment instructions for Vercel (frontend) and Render/Fly.io (backend).
Contributions are welcome! Please open an issue or submit a pull request for improvements.
For questions or inquiries, reach out at [email protected]. Chat with Brand Anthony McDonald in real-time by visiting https://i.brandanthonymcdonald.com/portfolio
This README.md provides a comprehensive overview of the project, technical stack, features, deployment, and strengths/weaknesses for potential contributors and employers. Let me know if you'd like to refine or expand any sections!
Made with ❤️ by [BAM](https://i.brandanthonymcdonald.com/portfolio)