Stripe Machine Learning Engineer Interview Questions
8+ questions from real Stripe Machine Learning Engineer interviews, reported by candidates.
Round Types
Top Topics
Questions
Account Takeover Prediction System ## The Task You need to design a Machine Learning system to predict the risk of "Account Takeover" (ATO) for a payments platform. ATO happens when hackers steal lo
Fraud Detection System ## Task Overview Design a machine learning system that finds fraudulent transactions on a payment platform. ### Main Focus Areas This problem mixes two types of design: **ML S
Building a Neural Network for Tabular Data ## What to Expect In this interview, you need to build a machine learning model. You will create a neural network that uses a table of data (rows and colum
Stripe phone screen experience for ML/fraud detection role
**Part 1 — Verify Transaction Data Integrity** The objective is to establish foundational data integrity for fraud detection. The solution involves reading six distinct fields from a CSV file and veri
I almost gave up
I just got an offer for a PhD ML internship at Stripe this summer, and I wanted to give back to the community since Reddit helped me a lot throughout my journey. For context, last year I shared this p
## Problem You are given a machine learning training script with several embedded bugs. Your task is to identify and fix them. **Bug hunt — find at least 4 issues in this pseudocode:** ```python def train(model, X_train, y_train, X_test, y_test): # Bug 1: Normalization uses test stats mean = X_test.mean(axis=0) std = X_test.std(axis=0) X_train = (X_train - mean) / std X_test = (X_test - mean) / std # Bug 2: Shuffle happens after train/test split (data already split above) shuffle_in_place(X_train, y_train) # should be before split for epoch in range(100): loss = model.forward(X_train, y_train) # Bug 3: Gradient not zeroed before backward model.backward(loss) model.step(lr=0.01) # Bug 4: Evaluating on training data, not test data acc = model.evaluate(X_train, y_train) print(f"Epoch {epoch}: acc={acc}") return model ``` ## Follow-ups 1. Why does normalizing using test statistics cause data leakage? 2. What is the effect of not zeroing gradients — in which framework (PyTorch/TF) does this matter most? 3. How would you structure a training loop to prevent these classes of bugs systematically? 4. What automated checks (e.g., assertions, dataset auditing) would you add before training starts?
## Round 1 - System Design ## Problem You have trained a recommendation model (collaborative filtering, ~500ms inference time). Design the integration layer that serves this model as part of a production API handling 10,000 requests per second. **Constraints:** - p99 latency target: 200ms end-to-end. - Model is updated daily with a full retrain. - Fallback required if the model is unavailable. ## Key Design Decisions **Serving Infrastructure** - Model server options: TorchServe, Triton, custom FastAPI. Trade-offs? - How do you handle the 500ms inference time given a 200ms latency budget? **Caching** - Pre-compute recommendations for top 10% most active users. - Cache invalidation on model update. - Cache key design: user_id + context hash (device, time-of-day bucket). **Fallback Strategy** - Popularity-based fallback when model is unreachable. - Circuit breaker pattern to avoid cascading failures. **Model Rollout** - Shadow mode: new model runs alongside old, compare outputs before full cutover. - Canary: route 5% of traffic to new model, monitor click-through rate before promoting. ## Follow-ups 1. How do you detect model degradation in production without labeled ground truth in real time? 2. What monitoring signals alert you to a model update causing a regression? 3. How would you A/B test two models while controlling for novelty effects?
## Round 1 - System Design ## Problem Design a machine learning platform for a streaming service that trains, evaluates, deploys, and monitors recommendation models at scale. The platform must support multiple teams running concurrent experiments. **Scope to address:** - Feature store design and online/offline serving. - Training pipeline orchestration. - Model registry and versioning. - Online serving with SLA guarantees. - Experiment tracking and A/B testing framework. - Monitoring for data drift, model drift, and pipeline failures. ## Key Components **Feature Store** - Offline: Hive/Spark for batch feature computation. - Online: Redis for low-latency feature retrieval at inference time. - Point-in-time correct joins to prevent future leakage in training. **Training Pipeline** - Orchestrated via Airflow or Kubeflow Pipelines. - Triggered on new data arrival or schedule; artifact versioning via MLflow. **Serving Layer** - Model registry with staging / production / shadow slots. - Canary deployments with automatic rollback on metric degradation. **Monitoring** - Feature distribution shift (KL divergence alerts). - Prediction distribution shift. - Business metric tracking (CTR, watch time) correlated to model versions. ## Follow-ups 1. How do you ensure training-serving skew is minimized in the feature pipeline? 2. What is your strategy for handling cold-start users in the recommendation model? 3. How do you enforce data governance (PII scrubbing) before features reach the training pipeline?
See All 8 Stripe Machine Learning Engineer Questions
Full question text, answer context, and frequency data for subscribers.
Get Access