Our Methodology

How we source, filter, translate, and maintain the database.

Last updated: May 2, 2026

Step 1: Sourcing

We run scrapers against 7 platforms. Each scraper is incremental, it uses a high-water mark (last scraped ID or timestamp) and only fetches new content on each run. No full backfills on daily cycles.

Source	Questions	Type
LeetCode Discuss	~22,000	Interview experiences, questions
Reddit (r/cscareerquestions, r/leetcode)	~9,000	Interview experiences, discussions
1Point3Acres (1p3a)	~5,700	Chinese-language interview experiences
LeetCode Company Tags	~4,000	Company-frequency tagged problems
InterviewDB	~1,300	Structured question signals
GeeksforGeeks (GFG)	~1,000	Interview questions by company
Blind	~62	High-signal paywalled discussions

Step 2: Filtering

Not everything scraped makes it into the database. We apply a multi-layer filter pipeline:

1. Junk detection. Posts with no usable interview content, rants, off-topic career advice, pure memes, are flagged is_junk=1 and excluded from all queries.
2. Deduplication. Near-duplicate questions (same title + company + year) are merged. The highest-quality copy is kept; duplicates are flagged is_duplicate=1.
3. Audit classification. A rules-based classifier tags each question with audit_content_type, audit_role, audit_topic, and audit_quality. Low-quality entries (broken translations, noise, sub-threshold content) are flagged dropped=1.
4. Company slug gating. We maintain a blocklist of ~574 English words that leaked in as "company names" from Reddit post-title parsing. A suspicious-slug heuristic further gates low-thread startups from appearing in public listings.

As of May 2026: 53,765 usable rows out of 59,000+ scraped (5,338 dropped, ~1,500 junk, ~600 duplicate).

Step 3: Translation

1p3a and Nowcoder content is written in Mandarin, often with CS-specific slang and abbreviations that generic translation APIs mangle. Our translation pipeline:

Runs each Chinese post through a translation model with prompt context about tech interviews
Normalises abbreviations: 算法题 (algorithm question), 系统设计 (system design), OC (offer consideration)
Flags retranslated content (retranslated=1) for QA review

Step 4: Enrichment

After filtering, each question goes through LLM-assisted enrichment (Gemini) to extract or infer:

Clean title

Role type (SWE/DS/ML/PM)

Interview round

Seniority level

Topic tags

Difficulty

LeetCode reference

Interview year

Enrichment runs nightly and hits a 10K requests/day Gemini quota cap, so the remaining ~48K rows are processed over multiple days. The database is live; enriched rows appear as they are processed.

Step 5: Freshness

Daily scrape jobs run via a Windows Task Scheduler task and a Fly.io cron-compatible script. Each scraper uses an incremental watermark so only new content is fetched. The database is updated continuously, questions are sorted by recency by default so the most recently reported content surfaces first.

What we don't do

✕ Generate synthetic interview questions with AI
✕ Fabricate candidate experiences or "experiences reported by candidates"
✕ Publish content candidates asked to keep private
✕ Show fake view counts, fake "X people viewing now" signals
✕ Pad company counts or question counts beyond real data

Our 7 data sources → About LeakCode → Contact us →