Our Methodology
How we source, filter, translate, and maintain the database.
Last updated: May 2, 2026
Step 1: Sourcing
We run scrapers against 7 platforms. Each scraper is incremental, it uses a high-water mark (last scraped ID or timestamp) and only fetches new content on each run. No full backfills on daily cycles.
| Source | Questions | Type |
|---|---|---|
| LeetCode Discuss | ~22,000 | Interview experiences, questions |
| Reddit (r/cscareerquestions, r/leetcode) | ~9,000 | Interview experiences, discussions |
| 1Point3Acres (1p3a) | ~5,700 | Chinese-language interview experiences |
| LeetCode Company Tags | ~4,000 | Company-frequency tagged problems |
| InterviewDB | ~1,300 | Structured question signals |
| GeeksforGeeks (GFG) | ~1,000 | Interview questions by company |
| Blind | ~62 | High-signal paywalled discussions |
Step 2: Filtering
Not everything scraped makes it into the database. We apply a multi-layer filter pipeline:
-
1.
Junk detection. Posts with no usable interview content, rants, off-topic career advice, pure memes, are flagged
is_junk=1and excluded from all queries. -
2.
Deduplication. Near-duplicate questions (same title + company + year) are merged. The highest-quality copy is kept; duplicates are flagged
is_duplicate=1. -
3.
Audit classification. A rules-based classifier tags each question with
audit_content_type,audit_role,audit_topic, andaudit_quality. Low-quality entries (broken translations, noise, sub-threshold content) are flaggeddropped=1. - 4. Company slug gating. We maintain a blocklist of ~574 English words that leaked in as "company names" from Reddit post-title parsing. A suspicious-slug heuristic further gates low-thread startups from appearing in public listings.
As of May 2026: 53,765 usable rows out of 59,000+ scraped (5,338 dropped, ~1,500 junk, ~600 duplicate).
Step 3: Translation
1p3a and Nowcoder content is written in Mandarin, often with CS-specific slang and abbreviations that generic translation APIs mangle. Our translation pipeline:
- Runs each Chinese post through a translation model with prompt context about tech interviews
- Normalises abbreviations: 算法题 (algorithm question), 系统设计 (system design), OC (offer consideration)
- Flags retranslated content (
retranslated=1) for QA review
Step 4: Enrichment
After filtering, each question goes through LLM-assisted enrichment (Gemini) to extract or infer:
Enrichment runs nightly and hits a 10K requests/day Gemini quota cap, so the remaining ~48K rows are processed over multiple days. The database is live; enriched rows appear as they are processed.
Step 5: Freshness
Daily scrape jobs run via a Windows Task Scheduler task and a Fly.io cron-compatible script. Each scraper uses an incremental watermark so only new content is fetched. The database is updated continuously, questions are sorted by recency by default so the most recently reported content surfaces first.
What we don't do
- ✕ Generate synthetic interview questions with AI
- ✕ Fabricate candidate experiences or "experiences reported by candidates"
- ✕ Publish content candidates asked to keep private
- ✕ Show fake view counts, fake "X people viewing now" signals
- ✕ Pad company counts or question counts beyond real data