Comprehensive spam detection system implementing both traditional machine learning and modern vector database approaches
🚀 Quick Start • 📖 Documentation • 🛠️ Technologies • 📊 Performance
- Naive Bayes Implementation – Traditional probabilistic classification with 86.74% accuracy
- Vector Database System – Modern FAISS-based similarity search with 98.57% accuracy
- Comparative Analysis – Side-by-side performance evaluation of both approaches
- Real-time Processing – Interactive classification with instant results
- NLTK Integration – Comprehensive natural language processing pipeline
- Multilingual Support – E5-base embeddings for cross-language compatibility
- Smart Preprocessing – Tokenization, stemming, stopword removal, and normalization
- Feature Engineering – Bag-of-words and dense vector representations
- GPU Optimization – CUDA-optimized for high-performance computing
- Memory Management – Efficient batch processing and memory optimization
- Error Handling – Robust fallback mechanisms and graceful degradation
- Modular Design – Clean separation of concerns and reusable components
- Python 3.8+ – Core programming language
- PyTorch – Deep learning framework for embeddings
- Scikit-learn – Machine learning algorithms and utilities
- Pandas – Data manipulation and analysis
- NLTK – Natural language processing toolkit
- HuggingFace Transformers – Pre-trained language models
- Multilingual E5-base – Cross-lingual sentence embeddings
- FAISS – Facebook AI Similarity Search for vector operations
- Naive Bayes – Probabilistic classification algorithm
- NLTK – Natural language processing and text preprocessing
Develop a comprehensive spam classification system that demonstrates both traditional machine learning approaches (Naive Bayes) and modern vector database techniques (FAISS + embeddings). The system should achieve high accuracy while providing insights into the comparative performance of different classification methodologies.
-
Data Processing Pipeline
- CSV data loading and preprocessing
- Text normalization and cleaning
- Feature extraction and vectorization
- Train/validation/test split with stratification
-
Naive Bayes Implementation
- Bag-of-words feature extraction
- Gaussian Naive Bayes classifier
- NLTK-based text preprocessing
- Traditional ML evaluation metrics
-
Vector Database System
- Multilingual E5-base embeddings
- FAISS index construction and management
- K-nearest neighbors classification
- Similarity-based spam detection
-
Comparative Analysis
- Performance metrics comparison
- Processing time analysis
- Accuracy and precision evaluation
- Real-time classification testing
| Method | Validation Accuracy | Test Accuracy | Processing Time |
|---|---|---|---|
| Naive Bayes | 87.17% | 86.74% | ~2s |
| Vector DB (k=1) | – | 98.57% | ~5s |
| Vector DB (k=3) | – | 98.57% | ~7s |
- Vector Database Approach achieves 11.83% higher accuracy than Naive Bayes
- FAISS Implementation provides superior performance for similarity-based classification
- Multilingual Embeddings enable cross-lingual spam detection capabilities
- Scalable Architecture supports real-time classification with minimal latency
-
Objective — This analysis demonstrates the effectiveness of modern vector database approaches compared to traditional machine learning methods for spam classification tasks.
-
Experimental Setup — Both systems were evaluated on the same dataset with identical train/test splits, ensuring fair comparison of methodologies.
-
Key Findings —
- Vector Database approach significantly outperforms Naive Bayes in accuracy
- Embedding-based classification provides better semantic understanding
- FAISS implementation offers scalable similarity search capabilities
- Performance Optimization —
- Batch processing for embedding generation
- GPU acceleration for transformer models
- Efficient FAISS indexing for fast retrieval
- Memory optimization for large-scale processing
- Python 3.8+ with pip
- CUDA-compatible GPU (optional, for faster processing)
- 4GB+ RAM (8GB+ recommended)
- 2GB+ Storage (for models and data)
# 1. Clone repository
git clone <repository-url>
cd spam-classification-system
# 2. Create a virtual environment
python -m venv venv
# Windows
venv\Scripts\activate
# macOS/Linux
source venv/bin/activate
# 3. Install dependencies
pip install pandas scikit-learn nltk gdown transformers torch faiss-cpu tqdm
| Use Case | Recommended Method | Rationale |
|---|---|---|
| High Accuracy Required | Vector Database (k=1) | 98.57% accuracy, semantic understanding |
| Fast Processing | Naive Bayes | 86.74% accuracy, ~2s processing time |
| Balanced Performance | Vector Database (k=3) | 98.57% accuracy, robust classification |
| Resource Constrained | Naive Bayes | Lower memory usage, CPU-only processing |
# Vector Database Configuration
VECTOR_CONFIG = {
"model_name": "intfloat/multilingual-e5-base",
"embedding_dim": 768,
"batch_size": 32,
"k_neighbors": 1,
"similarity_threshold": 0.7
}
# Naive Bayes Configuration
NB_CONFIG = {
"test_size": 0.1,
"random_state": 42,
"stratify": True,
"preprocessing": "nltk"
}
# Run the complete spam classification system
python app.py
What it does:
- 📥 Downloads and processes spam dataset
- 🔄 Implements both Naive Bayes and Vector DB approaches
- 🧮 Generates embeddings and builds FAISS index
- 🤖 Performs comparative classification analysis
- 📊 Provides detailed performance metrics
# Test individual components
python -c "import pandas as pd; print('Data processing OK')"
python -c "import torch; print('PyTorch OK')"
python -c "import faiss; print('FAISS OK')"
# Test model loading
python -c "from transformers import AutoModel; print('Transformers OK')"
# Test classification pipeline
python -c "from sklearn.naive_bayes import GaussianNB; print('Scikit-learn OK')"
spam-classification-system/
├── 📄 app.py # Main application script
├── 📊 aio_project_2_2.ipynb # Jupyter notebook with detailed analysis
├── 📋 requirements.txt # Python dependencies
├── 📁 data/ # Dataset directory
│ └── 2cls_spam_text_cls.csv # Spam classification dataset
├── 📁 models/ # Pre-trained models cache
└── 📖 README.md # Project documentation
- Data Loading: ~1s for 5,572 messages
- Naive Bayes Training: ~2s for complete pipeline
- Embedding Generation: ~15s for full dataset
- FAISS Index Building: ~1s for 5,014 vectors
- Real-time Classification: ~0.1s per message
- Base System: ~500MB RAM
- Model Loading: ~1GB RAM (E5-base)
- FAISS Index: ~50MB for 5K vectors
- Peak RAM: ~2GB during embedding generation
- Local Processing: All data processed locally, no external API calls
- Temporary Storage: Models cached locally for faster subsequent runs
- No Data Persistence: No permanent storage of user messages
- Open Source: Full transparency in implementation
- Use virtual environments for isolation
- Regular dependency updates for security
- Monitor resource usage during processing
- Implement proper error handling
# Solutions:
- Use CPU-only processing: device = torch.device("cpu")
- Reduce batch size in embedding generation
- Clear GPU cache: torch.cuda.empty_cache()
# Solutions:
- Manual download: python -c "import nltk; nltk.download('all')"
- Check internet connection
- Use offline NLTK data
# Solutions:
- Install CPU version: pip install faiss-cpu
- For GPU: pip install faiss-gpu
- Check system compatibility
-
Memory Optimization:
- Use smaller batch sizes for embedding generation
- Implement gradient checkpointing
- Clear unused variables and tensors
-
Speed Optimization:
- Enable GPU acceleration when available
- Use FAISS GPU indices for large datasets
- Implement caching for repeated operations
-
Accuracy Optimization:
- Fine-tune embedding models on domain data
- Experiment with different k values for KNN
- Implement ensemble methods
We welcome contributions!
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Commit changes:
git commit -m 'Add amazing feature' - Push the branch:
git push origin feature/amazing-feature - Open a Pull Request
- Python: PEP 8 + Black formatter
- Commits: Conventional Commits
- Documentation: Docstrings for all functions
- Testing: Unit tests for critical functions
# Naive Bayes Classification
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import LabelEncoder
# Initialize classifier
model = GaussianNB()
label_encoder = LabelEncoder()
# Train and predict
model.fit(X_train, y_train)
predictions = model.predict(X_test)
# Vector Database Classification
import faiss
from transformers import AutoModel, AutoTokenizer
# Initialize embedding model
model = AutoModel.from_pretrained("intfloat/multilingual-e5-base")
tokenizer = AutoTokenizer.from_pretrained("intfloat/multilingual-e5-base")
# Build FAISS index
index = faiss.IndexFlatIP(embedding_dim)
index.add(embeddings.astype("float32"))
# Custom embedding strategy
def custom_embedding_pipeline(texts):
# Implement custom embedding logic
pass
# Integration with other frameworks
def integrate_with_streamlit():
# Web interface integration
pass
MIT License – see LICENSE for details.
- Hugging Face – Transformers and pre-trained models
- Facebook AI Research – FAISS similarity search
- Scikit-learn – Machine learning algorithms
- NLTK – Natural language processing toolkit
- PyTorch – Deep learning framework
Github: https://github.com/Enigmask22/Advanced-Spam-Classification-System-2.2