Overview
Encoding categorical variables is one of the most consequential preprocessing decisions in machine learning, and one of the most poorly reasoned. One-hot encoding a high-cardinality variable creates hundreds of sparse columns. Label encoding an unordered categorical implies an ordinal relationship that doesn't exist. Target encoding without cross-validation leaks the target into features. Each wrong choice silently degrades model performance.
Numeric normalization has the same problem: standardization assumes approximately normal distributions; min-max scaling is sensitive to outliers; log transformation requires strictly positive values. Applying the wrong normalization to the wrong variable produces features that look clean but carry distorted information.
The Categorical Encoding & Normalization Strategy Prompt generates a complete encoding and normalization specification: method selection logic by variable type and downstream model, implementation details, and a validation framework that detects encoding-induced artifacts.
What you get: - Encoding method selection matrix by cardinality, ordinality, and model type - Normalization method selection by distribution and model sensitivity - High-cardinality variable handling strategies - Target encoding with leakage prevention - Encoding validation and artifact detection
Built for: ML engineers and data scientists who need defensible, model-appropriate feature preprocessing decisions.