Stripe

Stripe Machine Learning Engineer Onsite Coding Questions

3+ questions from real Stripe Machine Learning Engineer Onsite Coding rounds, reported by candidates who interviewed there.

3
Questions
2
Topic Areas
10+
Sources

What does the Stripe Onsite Coding round test?

The Stripe onsite coding round is the core technical evaluation. Machine Learning Engineer candidates typically see 2-3 algorithm and data structure problems. Problems range from medium to hard difficulty, and interviewers evaluate both correctness and code quality.

Top Topics in This Round

Stripe Machine Learning Engineer Onsite Coding Questions

## Problem You are given a machine learning training script with several embedded bugs. Your task is to identify and fix them. **Bug hunt — find at least 4 issues in this pseudocode:** ```python def train(model, X_train, y_train, X_test, y_test): # Bug 1: Normalization uses test stats mean = X_test.mean(axis=0) std = X_test.std(axis=0) X_train = (X_train - mean) / std X_test = (X_test - mean) / std # Bug 2: Shuffle happens after train/test split (data already split above) shuffle_in_place(X_train, y_train) # should be before split for epoch in range(100): loss = model.forward(X_train, y_train) # Bug 3: Gradient not zeroed before backward model.backward(loss) model.step(lr=0.01) # Bug 4: Evaluating on training data, not test data acc = model.evaluate(X_train, y_train) print(f"Epoch {epoch}: acc={acc}") return model ``` ## Follow-ups 1. Why does normalizing using test statistics cause data leakage? 2. What is the effect of not zeroing gradients — in which framework (PyTorch/TF) does this matter most? 3. How would you structure a training loop to prevent these classes of bugs systematically? 4. What automated checks (e.g., assertions, dataset auditing) would you add before training starts?

## Round 1 - System Design ## Problem You have trained a recommendation model (collaborative filtering, ~500ms inference time). Design the integration layer that serves this model as part of a production API handling 10,000 requests per second. **Constraints:** - p99 latency target: 200ms end-to-end. - Model is updated daily with a full retrain. - Fallback required if the model is unavailable. ## Key Design Decisions **Serving Infrastructure** - Model server options: TorchServe, Triton, custom FastAPI. Trade-offs? - How do you handle the 500ms inference time given a 200ms latency budget? **Caching** - Pre-compute recommendations for top 10% most active users. - Cache invalidation on model update. - Cache key design: user_id + context hash (device, time-of-day bucket). **Fallback Strategy** - Popularity-based fallback when model is unreachable. - Circuit breaker pattern to avoid cascading failures. **Model Rollout** - Shadow mode: new model runs alongside old, compare outputs before full cutover. - Canary: route 5% of traffic to new model, monitor click-through rate before promoting. ## Follow-ups 1. How do you detect model degradation in production without labeled ground truth in real time? 2. What monitoring signals alert you to a model update causing a regression? 3. How would you A/B test two models while controlling for novelty effects?

## Round 1 - System Design ## Problem Design a machine learning platform for a streaming service that trains, evaluates, deploys, and monitors recommendation models at scale. The platform must support multiple teams running concurrent experiments. **Scope to address:** - Feature store design and online/offline serving. - Training pipeline orchestration. - Model registry and versioning. - Online serving with SLA guarantees. - Experiment tracking and A/B testing framework. - Monitoring for data drift, model drift, and pipeline failures. ## Key Components **Feature Store** - Offline: Hive/Spark for batch feature computation. - Online: Redis for low-latency feature retrieval at inference time. - Point-in-time correct joins to prevent future leakage in training. **Training Pipeline** - Orchestrated via Airflow or Kubeflow Pipelines. - Triggered on new data arrival or schedule; artifact versioning via MLflow. **Serving Layer** - Model registry with staging / production / shadow slots. - Canary deployments with automatic rollback on metric degradation. **Monitoring** - Feature distribution shift (KL divergence alerts). - Prediction distribution shift. - Business metric tracking (CTR, watch time) correlated to model versions. ## Follow-ups 1. How do you ensure training-serving skew is minimized in the feature pipeline? 2. What is your strategy for handling cold-start users in the recommendation model? 3. How do you enforce data governance (PII scrubbing) before features reach the training pipeline?

See All 3 Questions from This Round

Full question text, answer context, and frequency data for subscribers.

Get Access