InterviewDB
Question
Data Splitting: Implement Stratified Train/Validation/Test Split for Imbalanced Datasets
phone
Question Details
Problem
You are building a training pipeline for an ML model. Given a dataset of labeled samples, implement a stratified split that preserves class proportions in each resulting partition.
python
def stratified_split(
data: list[dict],
label_key: str,
train_ratio: float = 0.7,
val_ratio: float = 0.15,
test_ratio: float = 0.15,
seed: int = 42
) -> tuple[list[dict], list[dict], list[dict]]:
**Returns** (train, val, test)
pass
Example:
data = [{"x": 1, "y": 0}, {"x": 2, "y": 1}, {"x": 3, "y": 0},
{"x": 4, "y": 1}, {"x": 5, "y": 0}, {"x": 6, "y": 1}]
# 3 class-0, 3 class-1
# train=4, val=1, test=1; each split ~50/50
Follow-ups
- Why is stratified splitting especially important for datasets with class imbalance (e.g., 95% negative, 5% positive)?
- What is the difference between stratified splitting and k-fold stratified cross-validation?
- If a class has only 2 samples total and you need a 3-way split, how do you handle it gracefully?
- How do you verify that the split is actually stratified -- what metric do you check per split?
Full Details
Problem
You are building a training pipeline for an ML model. Given a dataset of labeled samples, implement a stratified split that preserves class proportions in each resulting partition.
python
def stratified_split(
data: list[dict],
label_key: str,
train_ratio: float = 0.7,
val_ratio: float = 0.15,
test_ratio: float = 0.15,
seed: int = 42
) -> tuple[list[dict], list[dict], list[dict]]:
**Returns** (train, val, test)
pass
Example:
data = [{"x": 1, "y": 0}, {"x": 2, "y": 1}, {"x": 3, "y": 0},
{"x": 4, "y": 1}, {"x": 5, "y": 0}, {"x": 6, "y": 1}]
# 3 class-0, 3 class-1
# train=4, val=1, test=1; each split ~50/50
Follow-ups
- Why is stratified splitting especially important for datasets with class imbalance (e.g., 95% negative, 5% positive)?
- What is the difference between stratified splitting and k-fold stratified cross-validation?
- If a class has only 2 samples total and you need a 3-way split, how do you handle it gracefully?
- How do you verify that the split is actually stratified -- what metric do you check per split?
Free preview. Unlock all questions →
Topics
Mle
Phone