Classification/preprocessing/ishwari
This MR includes all preprocessing steps for the census income dataset, including:
- Cleaned missing values (?, NaN, Null, Na etc.) and standardized text.
- Dropped irrelevant or skewed features (capital-gain, capital-loss, education-num).
- Performed EDA by demonstrating distribution of numerical and categorical features, heatmap, correlation matrix, boxplot for outliers.
- Encoded categorical features using appropriate strategies:
- Binary encoding for sex
- One-hot encoding for race, relationship, marital-status, workclass, and grouped native-country
- Grouped rare/unknown categories to reduce sparsity
- Cleaned numerical features by clipping extreme outliers in hours-per-week.
Finalized processed dataset and notebook
- Dataset saved under: data/processed/
- Notebook: notebooks/classification/preprocessing_ishwari.ipynb