Complete preprocessing, added df.info()

7e827533 · Ishwari Niphade · 373aa834 · 7e827533
Commit 7e827533 authored 2 months ago by Ishwari Niphade
--- a/notebooks/induction logic programming (ILP)/aleph_preprocessing_ishwari.ipynb
+++ b/notebooks/induction logic programming (ILP)/aleph_preprocessing_ishwari.ipynb
@@ -2,7 +2,7 @@
 "cells": [
  {
   "cell_type": "code",
-   "execution_count": 229,
+   "execution_count": 17,
   "id": "5ce379e5",
   "metadata": {},
   "outputs": [
@@ -162,7 +162,7 @@
       "4             0             0              40           Cuba  <=50K  "
      ]
     },
-     "execution_count": 229,
+     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
@@ -174,6 +174,45 @@
    "df.head()\n"
   ]
  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "id": "ebc33264",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<class 'pandas.core.frame.DataFrame'>\n",
+      "RangeIndex: 48842 entries, 0 to 48841\n",
+      "Data columns (total 15 columns):\n",
+      " #   Column          Non-Null Count  Dtype \n",
+      "---  ------          --------------  ----- \n",
+      " 0   age             48842 non-null  int64 \n",
+      " 1   workclass       47879 non-null  object\n",
+      " 2   fnlwgt          48842 non-null  int64 \n",
+      " 3   education       48842 non-null  object\n",
+      " 4   education-num   48842 non-null  int64 \n",
+      " 5   marital-status  48842 non-null  object\n",
+      " 6   occupation      47876 non-null  object\n",
+      " 7   relationship    48842 non-null  object\n",
+      " 8   race            48842 non-null  object\n",
+      " 9   sex             48842 non-null  object\n",
+      " 10  capital-gain    48842 non-null  int64 \n",
+      " 11  capital-loss    48842 non-null  int64 \n",
+      " 12  hours-per-week  48842 non-null  int64 \n",
+      " 13  native-country  48568 non-null  object\n",
+      " 14  Income          48842 non-null  object\n",
+      "dtypes: int64(6), object(9)\n",
+      "memory usage: 5.6+ MB\n"
+     ]
+    }
+   ],
+   "source": [
+    "df.info()"
+   ]
+  },
  {
   "cell_type": "markdown",
   "id": "d38ce2f8",
@@ -189,7 +228,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 230,
+   "execution_count": 19,
   "id": "0c0ed369",
   "metadata": {},
   "outputs": [],
@@ -199,7 +238,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 231,
+   "execution_count": 20,
   "id": "2f6717e5",
   "metadata": {},
   "outputs": [
@@ -238,7 +277,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 232,
+   "execution_count": 21,
   "id": "f5c9a812",
   "metadata": {},
   "outputs": [
@@ -367,7 +406,7 @@
       "4           wife  black  female              40           cuba  <=50k  "
      ]
     },
-     "execution_count": 232,
+     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
@@ -397,7 +436,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 233,
+   "execution_count": 22,
   "id": "656aac7c",
   "metadata": {},
   "outputs": [
@@ -412,7 +451,7 @@
       "Name: income, dtype: object"
      ]
     },
-     "execution_count": 233,
+     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
@@ -439,7 +478,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 234,
+   "execution_count": 23,
   "id": "1fb851cd",
   "metadata": {},
   "outputs": [
@@ -568,7 +607,7 @@
       "4           wife  black  female        average           cuba  less_equal_50k  "
      ]
     },
-     "execution_count": 234,
+     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
@@ -614,7 +653,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 235,
+   "execution_count": 24,
   "id": "4a920d83",
   "metadata": {},
   "outputs": [
@@ -640,7 +679,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 236,
+   "execution_count": 25,
   "id": "05ecbbcb",
   "metadata": {},
   "outputs": [],
@@ -651,7 +690,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 237,
+   "execution_count": 26,
   "id": "eb25b6e3",
   "metadata": {},
   "outputs": [],
@@ -701,7 +740,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 238,
+   "execution_count": 27,
   "id": "e9d91ea8",
   "metadata": {},
   "outputs": [
@@ -713,7 +752,7 @@
       "Name: income, dtype: int64"
      ]
     },
-     "execution_count": 238,
+     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
@@ -744,7 +783,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 239,
+   "execution_count": 28,
   "id": "972750dc",
   "metadata": {},
   "outputs": [
@@ -792,7 +831,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 240,
+   "execution_count": 29,
   "id": "1984d074",
   "metadata": {},
   "outputs": [
@@ -862,7 +901,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 241,
+   "execution_count": 30,
   "id": "8976173c",
   "metadata": {},
   "outputs": [
@@ -912,7 +951,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 242,
+   "execution_count": 31,
   "id": "10296bac",
   "metadata": {},
   "outputs": [],

 %% Cell type:code id:5ce379e5 tags:
 ``` python
 import pandas as pd
 df = pd.read_csv('../../data/raw/census_income_dataset_original.csv')  # adjust path as needed
 df.head()
 ```
 %% Output
       age         workclass  fnlwgt  education  education-num  \
    0   39         State-gov   77516  Bachelors             13
    1   50  Self-emp-not-inc   83311  Bachelors             13
    2   38           Private  215646    HS-grad              9
    3   53           Private  234721       11th              7
    4   28           Private  338409  Bachelors             13
           marital-status         occupation   relationship   race     sex  \
    0       Never-married       Adm-clerical  Not-in-family  White    Male
    1  Married-civ-spouse    Exec-managerial        Husband  White    Male
    2            Divorced  Handlers-cleaners  Not-in-family  White    Male
    3  Married-civ-spouse  Handlers-cleaners        Husband  Black    Male
    4  Married-civ-spouse     Prof-specialty           Wife  Black  Female
       capital-gain  capital-loss  hours-per-week native-country Income
    0          2174             0              40  United-States  <=50K
    1             0             0              13  United-States  <=50K
    2             0             0              40  United-States  <=50K
    3             0             0              40  United-States  <=50K
    4             0             0              40           Cuba  <=50K
+%% Cell type:code id:ebc33264 tags:
+``` python
+df.info()
+```
+%% Output
+    <class 'pandas.core.frame.DataFrame'>
+    RangeIndex: 48842 entries, 0 to 48841
+    Data columns (total 15 columns):
+     #   Column          Non-Null Count  Dtype
+    ---  ------          --------------  -----
+     0   age             48842 non-null  int64
+     1   workclass       47879 non-null  object
+     2   fnlwgt          48842 non-null  int64
+     3   education       48842 non-null  object
+     4   education-num   48842 non-null  int64
+     5   marital-status  48842 non-null  object
+     6   occupation      47876 non-null  object
+     7   relationship    48842 non-null  object
+     8   race            48842 non-null  object
+     9   sex             48842 non-null  object
+     10  capital-gain    48842 non-null  int64
+     11  capital-loss    48842 non-null  int64
+     12  hours-per-week  48842 non-null  int64
+     13  native-country  48568 non-null  object
+     14  Income          48842 non-null  object
+    dtypes: int64(6), object(9)
+    memory usage: 5.6+ MB
 %% Cell type:markdown id:d38ce2f8 tags:
 ### Dropping Irrelevant Columns
 The following columns are not required for ILP-based learning:
 - `fnlwgt`: weighting column used for sampling, not predictive
 - `education-num`: duplicate of `education`
 - `capital-gain` and `capital-loss`: sparse, skewed, and mostly zero
 %% Cell type:code id:0c0ed369 tags:
 ``` python
 df.drop(['fnlwgt', 'education-num', 'capital-gain', 'capital-loss'], axis=1, inplace=True)
 ```
 %% Cell type:code id:2f6717e5 tags:
 ``` python
 # Checking duplicates
 print(f"Duplicates before: {df.duplicated().sum()}")
 # Dropping duplicates
 df = df.drop_duplicates()
 # Confirming removal
 print(f"Duplicates Remaining after drop: {df.duplicated().sum()}")
 ```
 %% Output
    Duplicates before: 5513
    Duplicates Remaining after drop: 0
 %% Cell type:markdown id:a9497646 tags:
 ### Cleaning Text and Handle Missing Values
 We:
 - Convert all text to lowercase
 - Strip whitespace
 - Replace symbols like `?`, `NaN`, and `None` with a uniform `'unknown'`
 %% Cell type:code id:f5c9a812 tags:
 ``` python
 # Standardizing column names
 df.columns = df.columns.str.strip().str.lower().str.replace('-', '_')
 # Cleaning string columns
 for col in df.select_dtypes(include='object').columns:
    df[col] = df[col].astype(str).str.strip().str.lower()
    df[col] = df[col].replace(['?', 'nan', 'na', 'none'], 'unknown')
    df[col] = df[col].str.replace('-', '_')
 # Removing trailing period from income
 df['income'] = df['income'].str.replace('.', '', regex=False)
 df.head()
 ```
 %% Output
       age         workclass  education      marital_status         occupation  \
    0   39         state_gov  bachelors       never_married       adm_clerical
    1   50  self_emp_not_inc  bachelors  married_civ_spouse    exec_managerial
    2   38           private    hs_grad            divorced  handlers_cleaners
    3   53           private       11th  married_civ_spouse  handlers_cleaners
    4   28           private  bachelors  married_civ_spouse     prof_specialty
        relationship   race     sex  hours_per_week native_country income
    0  not_in_family  white    male              40  united_states  <=50k
    1        husband  white    male              13  united_states  <=50k
    2  not_in_family  white    male              40  united_states  <=50k
    3        husband  black    male              40  united_states  <=50k
    4           wife  black  female              40           cuba  <=50k
 %% Cell type:markdown id:43c1e695 tags:
 ### Updating Income labels
 %% Cell type:code id:656aac7c tags:
 ``` python
 df['income'] = df['income'].replace({
    '<=50k': 'less_equal_50k',
    '>50k': 'greater_50k'
 })
 df['income'].head()
 ```
 %% Output
    0    less_equal_50k
    1    less_equal_50k
    2    less_equal_50k
    3    less_equal_50k
    4    less_equal_50k
    Name: income, dtype: object
 %% Cell type:markdown id:41f5401c tags:
 ### Discretizing `age` and `hours-per-week`
 We convert continuous values into discrete bins for easier rule learning:
 - `age`: young (≤30), middle (31–55), senior (>55)
 - `hours-per-week`: low (<30), average (30–50), high (>50)
 %% Cell type:code id:1fb851cd tags:
 ``` python
 def bin_age(age):
    if age <= 30:
        return 'young'
    elif age <= 55:
        return 'middle'
    else:
        return 'senior'
 def bin_hours(hours):
    if hours < 30:
        return 'low'
    elif hours <= 50:
        return 'average'
    else:
        return 'high'
 df['age'] = df['age'].apply(bin_age)
 df['hours_per_week'] = df['hours_per_week'].apply(bin_hours)
 df.head()
 ```
 %% Output
          age         workclass  education      marital_status         occupation  \
    0  middle         state_gov  bachelors       never_married       adm_clerical
    1  middle  self_emp_not_inc  bachelors  married_civ_spouse    exec_managerial
    2  middle           private    hs_grad            divorced  handlers_cleaners
    3  middle           private       11th  married_civ_spouse  handlers_cleaners
    4   young           private  bachelors  married_civ_spouse     prof_specialty
        relationship   race     sex hours_per_week native_country          income
    0  not_in_family  white    male        average  united_states  less_equal_50k
    1        husband  white    male            low  united_states  less_equal_50k
    2  not_in_family  white    male        average  united_states  less_equal_50k
    3        husband  black    male        average  united_states  less_equal_50k
    4           wife  black  female        average           cuba  less_equal_50k
 %% Cell type:markdown id:1c091702 tags:
 ### Grouping Native Countries into Regions
 To reduce sparsity and simplify ILP rule generation, we group native countries into broader regions such as:
 - North America
 - Latin America
 - Asia
 - Europe
 - Middle East
 - Africa
 - Unknown
 %% Cell type:code id:4a920d83 tags:
 ``` python
 print("🔍 Unique country names BEFORE mapping:")
 print(df['native_country'].str.strip().str.lower().unique())
 ```
 %% Output
    🔍 Unique country names BEFORE mapping:
    ['united_states' 'cuba' 'jamaica' 'india' 'unknown' 'mexico' 'south'
     'puerto_rico' 'honduras' 'england' 'canada' 'germany' 'iran'
     'philippines' 'italy' 'poland' 'columbia' 'cambodia' 'thailand' 'ecuador'
     'laos' 'taiwan' 'haiti' 'portugal' 'dominican_republic' 'el_salvador'
     'france' 'guatemala' 'china' 'japan' 'yugoslavia' 'peru'
     'outlying_us(guam_usvi_etc)' 'scotland' 'trinadad&tobago' 'greece'
     'nicaragua' 'vietnam' 'hong' 'ireland' 'hungary' 'holand_netherlands']
 %% Cell type:code id:05ecbbcb tags:
 ``` python
 # Dropped south because this value looks suspicious and may have been a partial/misentered country name
 df = df[df['native_country'] != 'South']
 ```
 %% Cell type:code id:eb25b6e3 tags:
 ``` python
 # Defining region lists
 north_america = [c.strip().lower() for c in ['united-states', 'canada', 'puerto-rico', 'outlying-us(guam-usvi-etc)']]
 latin_america = [c.strip().lower() for c in ['mexico', 'cuba', 'jamaica', 'honduras', 'el-salvador',
                 'columbia', 'guatemala', 'nicaragua', 'dominican-republic',
                 'trinadad&tobago', 'ecuador', 'haiti', 'peru']]
 asia = [c.strip().lower() for c in ['india', 'china', 'japan', 'vietnam', 'philippines', 'thailand',
        'cambodia', 'laos', 'taiwan', 'hong']]
 europe = [c.strip().lower() for c in ['england', 'germany', 'italy', 'poland', 'portugal', 'france',
          'greece', 'ireland', 'hungary', 'scotland', 'yugoslavia', 'holand-netherlands']]
 middle_east = [c.strip().lower() for c in ['iran']]
 africa = [c.strip().lower() for c in ['south-africa', 'egypt']]
 # Defining mapping function
 def map_country_to_region(country):
    if country in north_america:
        return 'north_america'
    elif country in latin_america:
        return 'latin_america'
    elif country in asia:
        return 'asia'
    elif country in europe:
        return 'europe'
    elif country in middle_east:
        return 'middle_east'
    elif country in africa:
        return 'africa'
    else:
        return 'unknown'
 # Applying mapping
 df['native_country'] = df['native_country'].apply(map_country_to_region)
 ```
 %% Cell type:markdown id:523d5485 tags:
 ### Stratified Sampling
 To ensure balanced examples for ILP learning, we take an equal number of rows from each class (`<=50K` and `>50K`) for a representative sample. We take 500 balanced rows (250 from each class)
 %% Cell type:code id:e9d91ea8 tags:
 ``` python
 df_sample = df.groupby('income', group_keys=False).apply(lambda x: x.sample(min(250, len(x)), random_state=42))
 df_sample = df_sample.reset_index(drop=True)
 df_sample['income'].value_counts()
 ```
 %% Output
    greater_50k       250
    less_equal_50k    250
    Name: income, dtype: int64
 %% Cell type:markdown id:09072a9d tags:
 ### Exploratory Data Analysis (EDA)
 %% Cell type:markdown id:61c8c808 tags:
 ##### 1. Target variable distribution
 %% Cell type:code id:972750dc tags:
 ``` python
 import seaborn as sns
 import matplotlib.pyplot as plt
 # Visualizing income distribution
 sns.countplot(x='income', data=df_sample)
 plt.title('Distribution of Income Classes')
 plt.xlabel('Income')
 plt.ylabel('Count')
 plt.show()
 ```
 %% Output
 %% Cell type:markdown id:f85690bf tags:
 Observation:
 The income classes are evenly distributed in the sampled dataset, with 250 rows each for `<=50k` and `>50k`. This stratified sampling ensures that the ILP system receives a balanced set of positive and negative examples.
 %% Cell type:markdown id:804e67a6 tags:
 ##### 2. Distribution of categorical features
 %% Cell type:code id:1984d074 tags:
 ``` python
 import matplotlib.pyplot as plt
 import seaborn as sns
 cat_cols = ['education', 'workclass', 'occupation', 'marital_status', 'sex', 'race', 'native_country', 'relationship']
 fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(18, 12))
 axes = axes.flatten()
 for i, col in enumerate(cat_cols):
    sns.countplot(x=col, data=df_sample, ax=axes[i], order=df_sample[col].value_counts().index)
    axes[i].set_title(f'{col.title()} Distribution')
    axes[i].tick_params(axis='x', rotation=45)
 fig.delaxes(axes[-1])
 plt.tight_layout()
 plt.show()
 ```
 %% Output
 %% Cell type:markdown id:82f3ca0f tags:
 Observation:
 1. Education is dominated by levels like 'hs-grad', 'some-college', and 'bachelors', while very few individuals have 'preschool' or '1st-4th' education.
 2. Workclass shows 'private' as the most common category, followed by 'self-employed' and 'government' roles.
 3. In occupation, technical, administrative, and managerial jobs appear most frequently, while categories like 'armed-forces' are extremely rare.
 4. Marital status is primarily 'married' and 'never-married', aligning with a typical working-age population.
 5. Sex distribution shows more males than females in this sample.
 6. Race is skewed towards 'white', with other races making up a small portion.
 7. Native country is heavily concentrated in the 'united-states', making other countries sparse — a good case for grouping into regions or using 'other'.
 8. Relationship status is mostly 'husband' and 'not-in-family', likely reflecting primary household earners or individuals living alone.
 %% Cell type:markdown id:d90f52be tags:
 ##### 2. Distribution of age and hours_per_week
 %% Cell type:code id:8976173c tags:
 ``` python
 # Plotting distributions of binned numerical features
 fig, axes = plt.subplots(1, 2, figsize=(12, 5))
 # Age bin distribution
 sns.countplot(x='age', data=df_sample, order=['young', 'middle', 'senior'], ax=axes[0])
 axes[0].set_title('Discretized Age Distribution')
 axes[0].set_xlabel('Age Group')
 axes[0].set_ylabel('Count')
 # Hours-per-week bin distribution
 sns.countplot(x='hours_per_week', data=df_sample, order=['low', 'average', 'high'], ax=axes[1])
 axes[1].set_title('Discretized Hours-per-Week Distribution')
 axes[1].set_xlabel('Workload Category')
 axes[1].set_ylabel('Count')
 plt.tight_layout()
 plt.show()
 ```
 %% Output
 %% Cell type:markdown id:a6cfde4e tags:
 Observation:
 1. The `age` variable shows a majority of individuals in the **middle** age range (31–55), followed by **young** (≤30) and fewer in the **senior** category (>55).
 2. The `hours-per-week` variable is also well-distributed, with most individuals working an **average** number of hours (30–50). Very few individuals fall in the **high** or **low** workload groups.
 3. This confirms that binning has worked as expected and the categories are reasonably balanced for rule learning in ILP.
 %% Cell type:code id:10296bac tags:
 ``` python
 df.to_csv('../../data/processed/processed_ilp_aleph_dataset.csv', index=False)
 ```