Our Methodology

How we source, filter, translate, and maintain the database.

Last updated: May 2, 2026

Step 1: Sourcing

We run scrapers against 7 platforms. Each scraper is incremental, it uses a high-water mark (last scraped ID or timestamp) and only fetches new content on each run. No full backfills on daily cycles.

Source Questions Type
LeetCode Discuss ~22,000 Interview experiences, questions
Reddit (r/cscareerquestions, r/leetcode) ~9,000 Interview experiences, discussions
1Point3Acres (1p3a) ~5,700 Chinese-language interview experiences
LeetCode Company Tags ~4,000 Company-frequency tagged problems
InterviewDB ~1,300 Structured question signals
GeeksforGeeks (GFG) ~1,000 Interview questions by company
Blind ~62 High-signal paywalled discussions

Step 2: Filtering

Not everything scraped makes it into the database. We apply a multi-layer filter pipeline:

  • 1. Junk detection. Posts with no usable interview content, rants, off-topic career advice, pure memes, are flagged is_junk=1 and excluded from all queries.
  • 2. Deduplication. Near-duplicate questions (same title + company + year) are merged. The highest-quality copy is kept; duplicates are flagged is_duplicate=1.
  • 3. Audit classification. A rules-based classifier tags each question with audit_content_type, audit_role, audit_topic, and audit_quality. Low-quality entries (broken translations, noise, sub-threshold content) are flagged dropped=1.
  • 4. Company slug gating. We maintain a blocklist of ~574 English words that leaked in as "company names" from Reddit post-title parsing. A suspicious-slug heuristic further gates low-thread startups from appearing in public listings.

As of May 2026: 53,765 usable rows out of 59,000+ scraped (5,338 dropped, ~1,500 junk, ~600 duplicate).

Step 3: Translation

1p3a and Nowcoder content is written in Mandarin, often with CS-specific slang and abbreviations that generic translation APIs mangle. Our translation pipeline:

  • Runs each Chinese post through a translation model with prompt context about tech interviews
  • Normalises abbreviations: 算法题 (algorithm question), 系统设计 (system design), OC (offer consideration)
  • Flags retranslated content (retranslated=1) for QA review

Step 4: Enrichment

After filtering, each question goes through LLM-assisted enrichment (Gemini) to extract or infer:

Clean title
Role type (SWE/DS/ML/PM)
Interview round
Seniority level
Topic tags
Difficulty
LeetCode reference
Interview year

Enrichment runs nightly and hits a 10K requests/day Gemini quota cap, so the remaining ~48K rows are processed over multiple days. The database is live; enriched rows appear as they are processed.

Step 5: Freshness

Daily scrape jobs run via a Windows Task Scheduler task and a Fly.io cron-compatible script. Each scraper uses an incremental watermark so only new content is fetched. The database is updated continuously, questions are sorted by recency by default so the most recently reported content surfaces first.

What we don't do

  • Generate synthetic interview questions with AI
  • Fabricate candidate experiences or "experiences reported by candidates"
  • Publish content candidates asked to keep private
  • Show fake view counts, fake "X people viewing now" signals
  • Pad company counts or question counts beyond real data