PufferPhish
A Chrome extension that detects phishing in real time using a hybrid approach: a CodeBERT classifier for URL-level inference and a rule-based engine that catches structural patterns like homoglyphs, brand spoofing, and suspicious subdomains.
Senior capstone project · University of Michigan · 2025
Warning Interstitial
When a URL scores above the danger threshold, the extension redirects the browser to a full-page warning before the page loads. The interstitial explains exactly why the site was flagged: number substitutions imitating known brands, suspicious TLDs, lack of association with legitimate services, and high-risk URL patterns. Users can return to safety or acknowledge the risk and proceed.

Dashboard
The web dashboard aggregates scan activity across all browsing sessions. The overview tracks threats blocked, safe sites visited, warnings issued, and uptime alongside a weekly scan volume chart. The history tab provides a searchable, filterable log of every blocked attempt with threat types, risk scores, and the option to view full reports or whitelist false positives.

.png)
How It Works
Every URL you visit passes through two detection layers that run in parallel. The final risk score takes the higher of the two, and rule-based flags are always appended for explainability so users see exactly why a URL was flagged.
ML Classification
A CodeBERT model (via ProtectAI's ONNX-optimized checkpoint) classifies URLs into four categories: benign, phishing, malware, and defacement. The model runs as a separate Python service and returns confidence scores for each label, with sub-second inference on warm requests.
Rule-Based Analysis
Scores URLs across three weighted dimensions: URL structure patterns like IP-address URLs, excessive subdomains, and URL encoding tricks (40%); domain reputation signals like homoglyph substitutions, phishing keywords, and digit injection (40%); and content indicators like suspicious path patterns, brand spoofing, and non-standard ports (20%).
Risk Thresholds
| Risk | Action |
|---|---|
| Below 30% | Safe. Green shield icon with scan stats in popup. |
| 30% to 70% | Warning. Amber alert with threat details in popup. |
| Above 70% | Danger. Auto-redirect to full-page warning interstitial. |
Architecture
The system is a Turbo monorepo with five packages. The Chrome extension's service worker intercepts every tab navigation and sends the URL to a Node.js API running on AWS Lambda. The API checks a 24-hour PostgreSQL cache, then fans out to the ML engine and rule-based analyzer in parallel. Infrastructure is defined as code with AWS CDK: Lambda functions, API Gateway, RDS, S3 for model storage, Cognito for auth, and CloudFront for the dashboard.
Chrome Extension (Manifest v3)
├── Service Worker analyzes URLs on tab navigation, caches results
├── Popup UI risk score, threat flags, scan statistics
├── Warning Page full-page interstitial for blocked sites
└── Content Script page-level interaction
│
▼
API (Node.js, AWS Lambda)
├── /analyze validates URL, checks cache, invokes ML + rules
├── /stats scan counts, threats blocked, recent analyses
├── /settings auto-block, notifications, domain whitelist
└── /feedback user corrections for model improvement
│
▼
ML Engine (Python, AWS Lambda)
└── CodeBERT ONNX 4-class URL classification
│
▼
PostgreSQL (AWS RDS)
└── Analyses, URL cache (24h TTL), settings, threat intelBuilt with
Context
Senior capstone project at the University of Michigan, 2025. I led the team and implemented the full codebase. The detection system combines a pre-trained CodeBERT model from Hugging Face with a hand-tuned rule-based engine, deployed on AWS with CDK-managed infrastructure.