DoorDash Data Scientist Interview Questions
6+ questions from real DoorDash Data Scientist interviews, reported by candidates.
Round Types
Top Topics
Questions
Completed last week: I got 5 SQL and 2 Python questions and successfully answered all of them, but I still didn’t move forward to the next round. No idea why—maybe the interviewer just didn’t like me?
## Round 1 - Data Science Case Study ## Problem DoorDash is piloting a bike courier program ("Biker Dashers") in a dense urban market. You have one week of delivery data. Evaluate whether the program is performing well and recommend whether to expand, hold, or cut the program. **Data Available:** - `deliveries(delivery_id, dasher_id, dasher_type, pickup_time, delivery_time, distance_km, order_value, tip)` - `dashers(dasher_id, dasher_type, city, start_date)` **Your analysis should cover:** 1. **Efficiency metrics:** Average delivery time by dasher type (bike vs. car). Is bike faster for short distances? 2. **Coverage gap:** What fraction of deliveries are beyond typical bike range (e.g. > 3 km)? 3. **Economics:** Compare earnings-per-hour for bike vs. car dashers including tips. Are bikers fairly compensated? 4. **Customer impact:** Is there a significant difference in delivery time variance? Do customers of bikers rate orders differently? ``` Sample insight: Bike Dashers: avg 18 min, car Dashers: avg 24 min for distances < 2km Bike coverage: 68% of all deliveries in pilot zone are < 2km -> Recommendation: expand in high-density zones ``` ## Follow-ups 1. What confounding variables might make bike dashers look artificially faster? 2. How would you run an A/B test to cleanly isolate the impact of bike vs. car assignment? 3. How would weather data change your analysis? 4. What guardrail metrics would you monitor to catch problems early in an expansion?
## Round 1 - Data Science Case Study ## Problem DoorDash receives a spike in customer complaints about food arriving cold. You are asked to diagnose the root cause and propose data-driven solutions. **Data Available:** - `deliveries(delivery_id, dasher_id, restaurant_id, pickup_time, delivery_time, food_prep_ready_at, distance_km, weather)` - `ratings(delivery_id, overall_score, food_temp_score, comments)` - `restaurants(restaurant_id, category, avg_prep_time_min)` **Your analysis should cover:** 1. **Wait time at restaurant:** How long does the dasher wait after food is ready? Longer waits = colder food at pickup. 2. **Transit time:** Is delivery distance or traffic the dominant factor? 3. **Segmentation:** Are certain restaurant categories (e.g. sushi, pizza) more affected? 4. **Correlation:** Does `food_temp_score` correlate with `(delivery_time - food_prep_ready_at)`? ``` Key metric: "food exposure time" = delivery_time - food_prep_ready_at Hypothesis: food_temp_score drops ~0.3 points per 5 extra minutes of exposure ``` ## Follow-ups 1. How would you distinguish "cold at pickup" vs. "cooled during transit"? 2. What operational levers does DoorDash have to reduce cold food? Which would you prioritize? 3. How would you design an experiment to test whether insulated bags reduce cold complaints? 4. What metric would you track as a leading indicator before customers complain?
## Round 1 - SQL ## Problem You have the following DoorDash schema: ```sql CREATE TABLE customers (customer_id INT, name VARCHAR, city VARCHAR, joined_date DATE); CREATE TABLE orders (order_id INT, customer_id INT, restaurant_id INT, order_total DECIMAL, placed_at TIMESTAMP); ``` **Task 1:** Find all customers who placed orders on at least 5 distinct days in the past 30 days. **Task 2:** For each city, find the customer with the highest number of orders this month. Break ties by customer_id ascending. **Task 3:** Find customers who ordered from the same restaurant at least 3 times. Return customer_id, restaurant_id, and order_count. ```sql -- Task 1 skeleton: SELECT customer_id FROM orders WHERE placed_at >= CURRENT_DATE - INTERVAL '30 days' GROUP BY customer_id HAVING COUNT(DISTINCT DATE(placed_at)) >= 5; ``` ```sql -- Task 2 skeleton: WITH ranked AS ( SELECT c.city, o.customer_id, COUNT(*) AS order_count, RANK() OVER (PARTITION BY c.city ORDER BY COUNT(*) DESC, o.customer_id ASC) AS rnk FROM orders o JOIN customers c USING (customer_id) WHERE DATE_TRUNC('month', o.placed_at) = DATE_TRUNC('month', CURRENT_DATE) GROUP BY c.city, o.customer_id ) SELECT city, customer_id, order_count FROM ranked WHERE rnk = 1; ``` ## Follow-ups 1. In Task 1, why use `COUNT(DISTINCT DATE(placed_at))` instead of `COUNT(*)`? 2. In Task 2, what is the difference between `RANK()` and `ROW_NUMBER()` for tie-breaking? 3. For Task 3, how would you find customers who ordered the same restaurant on consecutive days? 4. How would you optimize these queries on a 500M-row orders table?
## Round 1 - Data Science Case Study ## Problem DoorDash's order volume dropped 15% overnight in one metro area. You are the analyst on call. Walk through your full investigation. **Data Available:** - `orders(order_id, customer_id, restaurant_id, dasher_id, placed_at, status, order_total)` - `restaurants(restaurant_id, name, category, is_active, city)` - `app_events(event_id, user_id, event_type, platform, occurred_at)` **Step 1 - Scope the drop:** - Is the drop on all platforms or just iOS/Android/web? - Is it all restaurant categories or specific ones? - Is it all customer cohorts (new, returning) or one? **Step 2 - Check data integrity:** - Is the drop real or a logging failure? Check if `app_events` also dropped. **Step 3 - Hypothesize causes:** - Competitor promotion launched - App deployment introduced checkout bug - Major restaurant chain went offline - Extreme weather event **Step 4 - Validate and quantify:** ```sql SELECT restaurant_id, COUNT(*) AS orders FROM orders WHERE placed_at BETWEEN '2024-06-01' AND '2024-06-02' GROUP BY restaurant_id ORDER BY orders; ``` ## Follow-ups 1. How would you determine if the drop is seasonal vs. an anomaly? 2. What would you look at if order volume is fine but revenue dropped? 3. How do you communicate your findings to stakeholders before you have a root cause? 4. After identifying the cause, what process do you put in place to catch this faster next time?
## Round 1 - Data Science Case Study ## Problem Some DoorDash restaurants are accepting more orders than they can fulfill on time, leading to late deliveries and cancellations. Design a system to detect overwhelmed merchants and intervene. **Data Available:** - `orders(order_id, restaurant_id, placed_at, accepted_at, ready_at, delivered_at, status)` - `restaurants(restaurant_id, avg_prep_time_min, max_concurrent_orders)` **Part A - Detection Metric:** Define a real-time "overwhelm score" per restaurant. Consider: - Current active order count vs. `max_concurrent_orders` - Running average actual prep time vs. `avg_prep_time_min` in last 30 minutes - Cancellation rate in last hour **Part B - Intervention Decision:** At what threshold do you temporarily pause new orders to a restaurant? How do you avoid false positives that hurt restaurant revenue? **Part C - SQL:** Find restaurants where actual prep time exceeded estimated prep time by more than 50% in the last 2 hours, with at least 5 orders. ```sql SELECT restaurant_id, AVG(EXTRACT(EPOCH FROM (ready_at - placed_at))/60) AS avg_actual_prep, COUNT(*) AS order_count FROM orders WHERE placed_at >= NOW() - INTERVAL '2 hours' AND ready_at IS NOT NULL GROUP BY restaurant_id HAVING COUNT(*) >= 5; -- Join with restaurants to compare against avg_prep_time_min ``` ## Follow-ups 1. How do you distinguish a permanently slow restaurant from one having a one-off bad hour? 2. What notification do you send to the restaurant and what action can they take? 3. How would you A/B test whether pausing orders actually improves customer satisfaction? 4. What customer-facing message do you show when a restaurant is temporarily paused?
See All 6 DoorDash Data Scientist Questions
Full question text, answer context, and frequency data for subscribers.
Get Access