Square/Block

Square/Block Data Scientist Interview Questions

7+ questions from real Square/Block Data Scientist interviews, reported by candidates.

7
Questions
1
Round Types
4
Topic Areas
2022
Year Range

Round Types

Onsite 1

Top Topics

Questions

Q1. How to control leaf height and Pruning in Decission Tree? Ans: To control the leaf size, we can set the parameters:- 1. Maximum depth : Maximum tree depth is a limit to...

## Problem A bookstore sells a series of 8 books. If you buy `k` distinct books in one purchase, you get a discount: - 1 book: 0%, 2 books: 5%, 3 books: 10%, 4 books: 20%, 5: 25%. Each book costs $8. Given a list of books (by index 1-8) a customer wants to buy (with repetition), compute the minimum total cost. ```python def min_cost(books: list[int]) -> float: # books: list of book indices, may repeat pass ``` **Example:** ``` books = [1, 1, 2, 2, 3, 3, 4, 5] output -> 51.20 # Two groups of 4 is cheaper than one group of 5 + one of 3 ``` ## Approach Greedy with frequency counts tends to over-optimize; dynamic programming on the count vector is more reliable. Count frequency of each title, then use DP over possible group sizes. ## Follow-ups 1. Why does a pure greedy (always take the largest possible discount group) fail on some inputs? 2. How does the complexity scale as the number of distinct titles grows to 20? 3. Modify the discount table to be non-monotonic -- does that break your approach? 4. A customer has a budget cap B. Find the maximum number of books they can buy.

## Problem You have a shelter database with the following tables: ```sql -- cats(id, name, breed, intake_date, adopted_date) -- adopted_date NULL if not yet adopted -- breeds(id, name, origin_country, hypoallergenic BOOLEAN) ``` **Query 1:** List each breed and the number of cats adopted, ordered by adoptions descending. ```sql SELECT b.name AS breed, COUNT(c.id) AS adopted_count FROM breeds b LEFT JOIN cats c ON c.breed = b.name AND c.adopted_date IS NOT NULL GROUP BY b.name ORDER BY adopted_count DESC; ``` **Query 2:** Find the cat that stayed in the shelter the longest before being adopted. ```sql SELECT name, DATEDIFF(adopted_date, intake_date) AS days_stayed FROM cats WHERE adopted_date IS NOT NULL ORDER BY days_stayed DESC LIMIT 1; ``` ## Follow-ups 1. Rewrite Query 2 to also return cats that are still in the shelter, computing days from `intake_date` to today. 2. Find breeds where more than 50% of intakes have been adopted. 3. Write a query to detect duplicate cat names within the same breed. 4. How would you index this schema to make both queries fast at 10 million rows?

## Problem You have two tables: ```sql -- employees(id, name, department_id, salary, hire_date, manager_id) -- departments(id, name, location) ``` **Query 1:** For each department, return the department name, headcount, average salary (rounded to 2 decimals), and the highest-paid employee's name. ```sql SELECT d.name AS department, COUNT(e.id) AS headcount, ROUND(AVG(e.salary), 2) AS avg_salary, MAX(e.name) KEEP (DENSE_RANK FIRST ORDER BY e.salary DESC) AS top_earner FROM departments d LEFT JOIN employees e ON e.department_id = d.id GROUP BY d.name; ``` **Query 2:** Find employees who earn more than their direct manager. ```sql SELECT e.name AS employee, m.name AS manager FROM employees e JOIN employees m ON e.manager_id = m.id WHERE e.salary > m.salary; ``` ## Follow-ups 1. Rewrite Query 1 in standard SQL (no vendor-specific `KEEP ... DENSE_RANK`). 2. Departments with no employees should still appear -- is your join correct? 3. Add a column showing each employee's salary percentile within their department. 4. How would you index these tables to make both queries efficient at 5 million rows?

## Problem You start with `n` coins all showing tails (0). You perform `k` operations. Each operation `i` flips every coin at positions that are multiples of `i` (1-indexed). After all operations, how many coins show heads? ```python def count_heads(n: int, k: int) -> int: pass ``` **Example:** ``` n=6, k=3 # Op 1: flip multiples of 1 -> [H,H,H,H,H,H] # Op 2: flip multiples of 2 -> [H,T,H,T,H,T] # Op 3: flip multiples of 3 -> [H,T,T,T,H,H] output -> 3 ``` ## Follow-ups 1. For `k >= n`, a coin at position `p` is flipped by every divisor of `p` up to `k`. Can you derive a closed-form count without simulating? 2. A coin ends heads if it is flipped an odd number of times. Which numbers have an odd number of divisors? (Perfect squares -- use this insight.) 3. Extend: each operation has a probability `p_i` of actually executing. Compute the expected number of heads. 4. What if flipping is toggling between 3 states (tails, heads, edge) instead of 2?

## Problem You have these tables: ```sql -- invoices(id, client_id, amount, issued_date, due_date, paid_date) -- paid_date NULL if unpaid -- clients(id, name, country) ``` **Query 1:** List all overdue invoices (due date passed, not paid) with client name, amount, and days overdue. ```sql SELECT c.name AS client, i.amount, DATEDIFF(CURRENT_DATE, i.due_date) AS days_overdue FROM invoices i JOIN clients c ON i.client_id = c.id WHERE i.paid_date IS NULL AND i.due_date < CURRENT_DATE ORDER BY days_overdue DESC; ``` **Query 2:** For each country, return total outstanding balance and number of overdue invoices. ```sql SELECT c.country, COUNT(*) AS overdue_count, SUM(i.amount) AS outstanding_balance FROM invoices i JOIN clients c ON i.client_id = c.id WHERE i.paid_date IS NULL AND i.due_date < CURRENT_DATE GROUP BY c.country ORDER BY outstanding_balance DESC; ``` ## Follow-ups 1. A client partially pays an invoice (payments table). Rewrite Query 1 to show remaining balance. 2. Add a 30/60/90-day aging bucket column to Query 1. 3. How would you set up a scheduled job to email overdue invoice reports? 4. Write the equivalent of Query 1 in Python pandas without SQL.

## Problem You have a payments table: ```sql -- payments(id, user_id, amount, currency, status, created_at) -- status: 'success' | 'failed' | 'pending' ``` **Query 1:** Daily transaction volume and failure rate for the last 30 days. ```sql SELECT DATE(created_at) AS day, COUNT(*) AS total, SUM(CASE WHEN status = 'failed' THEN 1 ELSE 0 END) AS failures, ROUND(100.0 * SUM(CASE WHEN status='failed' THEN 1 ELSE 0 END) / COUNT(*), 2) AS failure_pct FROM payments WHERE created_at >= CURRENT_DATE - INTERVAL 30 DAY GROUP BY DATE(created_at) ORDER BY day; ``` **Query 2:** 7-day rolling average of daily successful payment amounts. ```sql SELECT day, daily_total, AVG(daily_total) OVER ( ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW ) AS rolling_7d_avg FROM ( SELECT DATE(created_at) AS day, SUM(amount) AS daily_total FROM payments WHERE status='success' GROUP BY DATE(created_at) ) t; ``` ## Follow-ups 1. Currency conversion needed -- add an exchange-rate lookup table and normalize to USD. 2. A spike in failure rate may indicate fraud. How would you write an alert query? 3. Rewrite the rolling average in Python pandas. 4. How do you handle timezone differences when grouping by day?

See All 7 Square/Block Data Scientist Questions

Full question text, answer context, and frequency data for subscribers.

Get Access

Other Square/Block Role Questions