How LeakCode Works

LeakCode is not a static dataset. It is a live pipeline that runs every day, collecting, translating, deduplicating, and classifying interview questions from 7 primary sources. This page explains exactly how that works, end to end.

Collection: Scraping Seven Sources

Every day, a set of scrapers runs against LeakCode's 7 primary data sources. Each scraper is purpose-built for its target. They use a watermark-based incremental approach: only new content added since the last run is fetched, never a full backfill. This keeps daily scrape volume manageable and prevents rate-limit bans.

The 7 sources currently in production:

1Point3Acres (1p3a): Authenticated scraper for premium forum posts. Source key: 1p3a.
1p3a OJ Catalog: The 1Point3Acres company-tagged coding problem catalog. 3,553 structured problems with sample I/O. Source key: 1p3a_oj.
LeetCode Premium: Company-tagged problem lists. Source keys: leetcode and lc_company.
Blind: Anonymous interview experiences from tech professionals. Source key: blind.
Glassdoor: Interview reviews across thousands of companies. Source key: glassdoor.
GeeksforGeeks: Community-submitted experiences, strong for mid-market and Indian tech. Source key: gfg.
Reddit: Filtered posts from r/cscareerquestions, r/ExperiencedDevs, r/leetcode. Source key: reddit.

Raw output from each scraper is written to a source-specific staging table before entering the transform layer. Nothing hits the unified database until it has been processed.

Learn more about each source on the LeakCode Sources page.

Transform: Normalization and Deduplication

Raw data from 7 different sources is messy. Company names are inconsistent. The same question appears on multiple platforms with slightly different wording. Metadata fields are missing or formatted differently across sources.

The transform layer handles three normalization jobs:

Company name canonicalization. LeakCode maintains a canonical company name registry with 2,000+ official names and 2,300+ alias mappings. "Goog", "Google LLC", "Google (Mountain View)", and "GOOGL" all resolve to google. This is a hand-curated list maintained as code, not a lookup table.
Cross-source deduplication. The same question or experience often appears on multiple platforms. A fuzzy matching pass identifies near-duplicate entries across sources and marks them as duplicates. Only the highest-quality version is surfaced in search results.
Type classification. Every entry is classified as one of: coding question, behavioral question, system design question, take-home assignment, or interview experience narrative. This drives the filtering UI on Browse Companies.

Enrichment: Translation and Metadata Tagging

Most 1Point3Acres premium content is written in Chinese. This is the core access barrier LeakCode removes. The enrichment layer runs Chinese-language content through an LLM translation pipeline, then applies post-processing heuristics to decode interview-specific slang, expand abbreviations (OOD, SD, BQ, LC), and add missing context.

Every translated entry is spot-checked by a quality heuristic: translation confidence, character ratio, and structural coherence. Entries that fail the heuristic are flagged for human review rather than silently published.

Enrichment also assigns metadata tags to every entry:

Role. Software Engineer, Product Manager, Data Scientist, Machine Learning Engineer, Engineering Manager, Quantitative Finance, and more.
Round. Phone Screen, Technical Screen, Onsite, System Design, Behavioral, Take-Home, Bar Raiser.
Seniority. New Grad, Junior, Mid, Senior, Staff, Principal.
Topic. Algorithms, System Design, Dynamic Programming, Graphs, Trees, Arrays, Behavioral, and 15+ more.

These tags are what power the LeakCode search filters. They let you find exactly what Google asks senior SWEs in system design rounds, rather than browsing through everything.

Audit: Quality Filtering

Not everything collected is worth showing. The audit layer applies a multi-pass filter to catch junk before it reaches the unified database.

The audit pipeline checks for:

Broken translations. Garbled output, missing sentence endings, character encoding errors.
Spam and off-topic posts. Automated posts, vendor promotions, unrelated content scraped due to keyword matching.
Thin content. Entries shorter than a minimum length threshold that provide no useful signal.
Blocklisted patterns. A regularly updated blocklist of phrases and patterns that indicate junk at scale.

The current overall drop rate is approximately 10 percent. Content from 1p3a OJ is pre-structured and has a near-zero drop rate. Reddit content has the highest drop rate due to looser posting norms on public forums.

Entries that pass audit are written to unified.db, the single SQLite database that powers the LeakCode frontend. As of May 2026, unified.db contains 59,970 entries.

Serving: Search, Filter, and Frequency Ranking

The unified database is served by a FastAPI backend with full-text search and faceted filtering. Every query supports filtering by company, role, round, seniority, topic, source, and time window.

Questions are ranked by a frequency signal: how often a given question appears across multiple candidate reports, weighted by recency. A question that 15 candidates reported seeing in the last 90 days ranks higher than one that appeared once in 2022.

Access to the full database, including translated 1p3a premium content, requires a LeakCode subscription. Free users can browse company pages and see a limited preview. Paid users get full access to all 59,970 entries with no question-per-day limits.

Data Freshness

The pipeline runs on a daily schedule. New entries appear in the database within 24 hours of being posted to the source platform, subject to translation and audit time.

Source	Update Frequency	Typical Lag
1p3a forum posts	Daily	24h (translation adds ~6h)
1p3a OJ catalog	Weekly	Under 7 days
LeetCode company tags	Weekly	Under 7 days
Blind	Daily	24h
Glassdoor	Daily	24h
GeeksforGeeks	Daily	24h
Reddit	Daily	24h