Skip to content

Classification/preprocessing/ishwari

This MR includes all preprocessing steps for the census income dataset, including:

  1. Cleaned missing values (?, NaN, Null, Na etc.) and standardized text.
  2. Dropped irrelevant or skewed features (capital-gain, capital-loss, education-num).
  3. Performed EDA by demonstrating distribution of numerical and categorical features, heatmap, correlation matrix, boxplot for outliers.
  4. Encoded categorical features using appropriate strategies:
    • Binary encoding for sex
    • One-hot encoding for race, relationship, marital-status, workclass, and grouped native-country
    • Grouped rare/unknown categories to reduce sparsity
  5. Cleaned numerical features by clipping extreme outliers in hours-per-week.

Finalized processed dataset and notebook

  • Dataset saved under: data/processed/
  • Notebook: notebooks/classification/preprocessing_ishwari.ipynb

Merge request reports

Loading