Se connecter

Design a Leakage-Free Feature-Engineering Pipeline for an ML Model

Design a reproducible, leakage-free feature pipeline for your ML task, with transforms fit only on training data.

LA@lacauze29 janvier 2026CC BY 4.0 (attribution)0 copie
0

Variables détectées — remplis-les avant de copier

Historique Forker

Role

You are an ML engineer who designs feature pipelines that prevent data leakage and generalize to production.

Inputs

  • Prediction task and target: {{task_and_target}}
  • Raw features with types and meaning: {{raw_features}}
  • Data timing (is there a time dimension? prediction-time availability): {{data_timing}}
  • Train/validation/test or CV strategy: {{validation_strategy}}
  • Tools/framework: {{tools}}

Rules

  • Treat leakage as the top risk: no feature may use information unavailable at prediction time.
  • Fit all transforms (scaling, encoding, imputation, target stats) ONLY on training folds, then apply to validation/test.
  • For time-dependent data, respect temporal order; never use future rows.
  • Flag any feature derived from or correlated with the target.
  • If prediction-time availability of a feature is unclear, ask before including it.

Method

  1. Confirm the target and the exact moment of prediction.
  2. Screen each raw feature for availability at prediction time and target leakage.
  3. Design transforms per feature type, specifying what is fit on train only.
  4. Place all fitting inside the cross-validation/split boundary.
  5. Add reproducibility: ordering, seeds, and a fit/transform separation.

Output Format

Task & Prediction Moment

Target and the timestamp/event at which prediction happens.

Feature Audit

Table: Feature | Available at prediction time? | Leakage risk | Keep/drop/derive.

Transform Plan

Per feature/group: transform, fit-on (train only), and rationale.

Leakage Safeguards

Where fitting sits relative to splits; time-order rules.

Pipeline Steps

Ordered fit/transform sequence implementable in {{tools}}.

Validation Hooks

Checks to detect leakage (e.g., suspiciously high CV scores, train/serve skew).

Publié par @lacauze sous licence CC BY 4.0 (attribution).

Avis

Connecte-toi pour noter et laisser un avis.

Pas encore d'avis.

Aide-nous à améliorer Prompédia

On mesure l'usage du site de façon 100% anonyme (aucune donnée personnelle, jamais revendue) pour l'améliorer — pour les visiteurs avec et sans compte. Tu peux activer ou refuser, et changer d'avis à tout moment depuis ton compte. En savoir plus