InterviewDB Question

Data Splitting: Implement Stratified Train/Validation/Test Split for Imbalanced Datasets

Question Details

Problem

You are building a training pipeline for an ML model. Given a dataset of labeled samples, implement a stratified split that preserves class proportions in each resulting partition.

python
def stratified_split(
    data: list[dict],
    label_key: str,
    train_ratio: float = 0.7,
    val_ratio: float = 0.15,
    test_ratio: float = 0.15,
    seed: int = 42
) -> tuple[list[dict], list[dict], list[dict]]:

**Returns** (train, val, test)
    pass

Example:

data = [{"x": 1, "y": 0}, {"x": 2, "y": 1}, {"x": 3, "y": 0},
        {"x": 4, "y": 1}, {"x": 5, "y": 0}, {"x": 6, "y": 1}]
# 3 class-0, 3 class-1
# train=4, val=1, test=1; each split ~50/50

Follow-ups

  1. Why is stratified splitting especially important for datasets with class imbalance (e.g., 95% negative, 5% positive)?
  2. What is the difference between stratified splitting and k-fold stratified cross-validation?
  3. If a class has only 2 samples total and you need a 3-way split, how do you handle it gracefully?
  4. How do you verify that the split is actually stratified -- what metric do you check per split?

Full Details

Problem

You are building a training pipeline for an ML model. Given a dataset of labeled samples, implement a stratified split that preserves class proportions in each resulting partition.

python
def stratified_split(
    data: list[dict],
    label_key: str,
    train_ratio: float = 0.7,
    val_ratio: float = 0.15,
    test_ratio: float = 0.15,
    seed: int = 42
) -> tuple[list[dict], list[dict], list[dict]]:

**Returns** (train, val, test)
    pass

Example:

data = [{"x": 1, "y": 0}, {"x": 2, "y": 1}, {"x": 3, "y": 0},
        {"x": 4, "y": 1}, {"x": 5, "y": 0}, {"x": 6, "y": 1}]
# 3 class-0, 3 class-1
# train=4, val=1, test=1; each split ~50/50

Follow-ups

  1. Why is stratified splitting especially important for datasets with class imbalance (e.g., 95% negative, 5% positive)?
  2. What is the difference between stratified splitting and k-fold stratified cross-validation?
  3. If a class has only 2 samples total and you need a 3-way split, how do you handle it gracefully?
  4. How do you verify that the split is actually stratified -- what metric do you check per split?
Free preview. Unlock all questions →

Topics

Mle Phone