Skip to content
Snippets Groups Projects
Commit 7e827533 authored by Ishwari Niphade's avatar Ishwari Niphade
Browse files

Complete preprocessing, added df.info()

parent 373aa834
No related branches found
No related tags found
1 merge request!6ilp-classification/preprocessing/ishwari
Pipeline #104188 passed
%% Cell type:code id:5ce379e5 tags: %% Cell type:code id:5ce379e5 tags:
``` python ``` python
import pandas as pd import pandas as pd
df = pd.read_csv('../../data/raw/census_income_dataset_original.csv') # adjust path as needed df = pd.read_csv('../../data/raw/census_income_dataset_original.csv') # adjust path as needed
df.head() df.head()
``` ```
%% Output %% Output
age workclass fnlwgt education education-num \ age workclass fnlwgt education education-num \
0 39 State-gov 77516 Bachelors 13 0 39 State-gov 77516 Bachelors 13
1 50 Self-emp-not-inc 83311 Bachelors 13 1 50 Self-emp-not-inc 83311 Bachelors 13
2 38 Private 215646 HS-grad 9 2 38 Private 215646 HS-grad 9
3 53 Private 234721 11th 7 3 53 Private 234721 11th 7
4 28 Private 338409 Bachelors 13 4 28 Private 338409 Bachelors 13
marital-status occupation relationship race sex \ marital-status occupation relationship race sex \
0 Never-married Adm-clerical Not-in-family White Male 0 Never-married Adm-clerical Not-in-family White Male
1 Married-civ-spouse Exec-managerial Husband White Male 1 Married-civ-spouse Exec-managerial Husband White Male
2 Divorced Handlers-cleaners Not-in-family White Male 2 Divorced Handlers-cleaners Not-in-family White Male
3 Married-civ-spouse Handlers-cleaners Husband Black Male 3 Married-civ-spouse Handlers-cleaners Husband Black Male
4 Married-civ-spouse Prof-specialty Wife Black Female 4 Married-civ-spouse Prof-specialty Wife Black Female
capital-gain capital-loss hours-per-week native-country Income capital-gain capital-loss hours-per-week native-country Income
0 2174 0 40 United-States <=50K 0 2174 0 40 United-States <=50K
1 0 0 13 United-States <=50K 1 0 0 13 United-States <=50K
2 0 0 40 United-States <=50K 2 0 0 40 United-States <=50K
3 0 0 40 United-States <=50K 3 0 0 40 United-States <=50K
4 0 0 40 Cuba <=50K 4 0 0 40 Cuba <=50K
%% Cell type:code id:ebc33264 tags:
``` python
df.info()
```
%% Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 48842 non-null int64
1 workclass 47879 non-null object
2 fnlwgt 48842 non-null int64
3 education 48842 non-null object
4 education-num 48842 non-null int64
5 marital-status 48842 non-null object
6 occupation 47876 non-null object
7 relationship 48842 non-null object
8 race 48842 non-null object
9 sex 48842 non-null object
10 capital-gain 48842 non-null int64
11 capital-loss 48842 non-null int64
12 hours-per-week 48842 non-null int64
13 native-country 48568 non-null object
14 Income 48842 non-null object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB
%% Cell type:markdown id:d38ce2f8 tags: %% Cell type:markdown id:d38ce2f8 tags:
### Dropping Irrelevant Columns ### Dropping Irrelevant Columns
The following columns are not required for ILP-based learning: The following columns are not required for ILP-based learning:
- `fnlwgt`: weighting column used for sampling, not predictive - `fnlwgt`: weighting column used for sampling, not predictive
- `education-num`: duplicate of `education` - `education-num`: duplicate of `education`
- `capital-gain` and `capital-loss`: sparse, skewed, and mostly zero - `capital-gain` and `capital-loss`: sparse, skewed, and mostly zero
%% Cell type:code id:0c0ed369 tags: %% Cell type:code id:0c0ed369 tags:
``` python ``` python
df.drop(['fnlwgt', 'education-num', 'capital-gain', 'capital-loss'], axis=1, inplace=True) df.drop(['fnlwgt', 'education-num', 'capital-gain', 'capital-loss'], axis=1, inplace=True)
``` ```
%% Cell type:code id:2f6717e5 tags: %% Cell type:code id:2f6717e5 tags:
``` python ``` python
# Checking duplicates # Checking duplicates
print(f"Duplicates before: {df.duplicated().sum()}") print(f"Duplicates before: {df.duplicated().sum()}")
# Dropping duplicates # Dropping duplicates
df = df.drop_duplicates() df = df.drop_duplicates()
# Confirming removal # Confirming removal
print(f"Duplicates Remaining after drop: {df.duplicated().sum()}") print(f"Duplicates Remaining after drop: {df.duplicated().sum()}")
``` ```
%% Output %% Output
Duplicates before: 5513 Duplicates before: 5513
Duplicates Remaining after drop: 0 Duplicates Remaining after drop: 0
%% Cell type:markdown id:a9497646 tags: %% Cell type:markdown id:a9497646 tags:
### Cleaning Text and Handle Missing Values ### Cleaning Text and Handle Missing Values
We: We:
- Convert all text to lowercase - Convert all text to lowercase
- Strip whitespace - Strip whitespace
- Replace symbols like `?`, `NaN`, and `None` with a uniform `'unknown'` - Replace symbols like `?`, `NaN`, and `None` with a uniform `'unknown'`
%% Cell type:code id:f5c9a812 tags: %% Cell type:code id:f5c9a812 tags:
``` python ``` python
# Standardizing column names # Standardizing column names
df.columns = df.columns.str.strip().str.lower().str.replace('-', '_') df.columns = df.columns.str.strip().str.lower().str.replace('-', '_')
# Cleaning string columns # Cleaning string columns
for col in df.select_dtypes(include='object').columns: for col in df.select_dtypes(include='object').columns:
df[col] = df[col].astype(str).str.strip().str.lower() df[col] = df[col].astype(str).str.strip().str.lower()
df[col] = df[col].replace(['?', 'nan', 'na', 'none'], 'unknown') df[col] = df[col].replace(['?', 'nan', 'na', 'none'], 'unknown')
df[col] = df[col].str.replace('-', '_') df[col] = df[col].str.replace('-', '_')
# Removing trailing period from income # Removing trailing period from income
df['income'] = df['income'].str.replace('.', '', regex=False) df['income'] = df['income'].str.replace('.', '', regex=False)
df.head() df.head()
``` ```
%% Output %% Output
age workclass education marital_status occupation \ age workclass education marital_status occupation \
0 39 state_gov bachelors never_married adm_clerical 0 39 state_gov bachelors never_married adm_clerical
1 50 self_emp_not_inc bachelors married_civ_spouse exec_managerial 1 50 self_emp_not_inc bachelors married_civ_spouse exec_managerial
2 38 private hs_grad divorced handlers_cleaners 2 38 private hs_grad divorced handlers_cleaners
3 53 private 11th married_civ_spouse handlers_cleaners 3 53 private 11th married_civ_spouse handlers_cleaners
4 28 private bachelors married_civ_spouse prof_specialty 4 28 private bachelors married_civ_spouse prof_specialty
relationship race sex hours_per_week native_country income relationship race sex hours_per_week native_country income
0 not_in_family white male 40 united_states <=50k 0 not_in_family white male 40 united_states <=50k
1 husband white male 13 united_states <=50k 1 husband white male 13 united_states <=50k
2 not_in_family white male 40 united_states <=50k 2 not_in_family white male 40 united_states <=50k
3 husband black male 40 united_states <=50k 3 husband black male 40 united_states <=50k
4 wife black female 40 cuba <=50k 4 wife black female 40 cuba <=50k
%% Cell type:markdown id:43c1e695 tags: %% Cell type:markdown id:43c1e695 tags:
### Updating Income labels ### Updating Income labels
%% Cell type:code id:656aac7c tags: %% Cell type:code id:656aac7c tags:
``` python ``` python
df['income'] = df['income'].replace({ df['income'] = df['income'].replace({
'<=50k': 'less_equal_50k', '<=50k': 'less_equal_50k',
'>50k': 'greater_50k' '>50k': 'greater_50k'
}) })
df['income'].head() df['income'].head()
``` ```
%% Output %% Output
0 less_equal_50k 0 less_equal_50k
1 less_equal_50k 1 less_equal_50k
2 less_equal_50k 2 less_equal_50k
3 less_equal_50k 3 less_equal_50k
4 less_equal_50k 4 less_equal_50k
Name: income, dtype: object Name: income, dtype: object
%% Cell type:markdown id:41f5401c tags: %% Cell type:markdown id:41f5401c tags:
### Discretizing `age` and `hours-per-week` ### Discretizing `age` and `hours-per-week`
We convert continuous values into discrete bins for easier rule learning: We convert continuous values into discrete bins for easier rule learning:
- `age`: young (≤30), middle (31–55), senior (>55) - `age`: young (≤30), middle (31–55), senior (>55)
- `hours-per-week`: low (<30), average (30–50), high (>50) - `hours-per-week`: low (<30), average (30–50), high (>50)
%% Cell type:code id:1fb851cd tags: %% Cell type:code id:1fb851cd tags:
``` python ``` python
def bin_age(age): def bin_age(age):
if age <= 30: if age <= 30:
return 'young' return 'young'
elif age <= 55: elif age <= 55:
return 'middle' return 'middle'
else: else:
return 'senior' return 'senior'
def bin_hours(hours): def bin_hours(hours):
if hours < 30: if hours < 30:
return 'low' return 'low'
elif hours <= 50: elif hours <= 50:
return 'average' return 'average'
else: else:
return 'high' return 'high'
df['age'] = df['age'].apply(bin_age) df['age'] = df['age'].apply(bin_age)
df['hours_per_week'] = df['hours_per_week'].apply(bin_hours) df['hours_per_week'] = df['hours_per_week'].apply(bin_hours)
df.head() df.head()
``` ```
%% Output %% Output
age workclass education marital_status occupation \ age workclass education marital_status occupation \
0 middle state_gov bachelors never_married adm_clerical 0 middle state_gov bachelors never_married adm_clerical
1 middle self_emp_not_inc bachelors married_civ_spouse exec_managerial 1 middle self_emp_not_inc bachelors married_civ_spouse exec_managerial
2 middle private hs_grad divorced handlers_cleaners 2 middle private hs_grad divorced handlers_cleaners
3 middle private 11th married_civ_spouse handlers_cleaners 3 middle private 11th married_civ_spouse handlers_cleaners
4 young private bachelors married_civ_spouse prof_specialty 4 young private bachelors married_civ_spouse prof_specialty
relationship race sex hours_per_week native_country income relationship race sex hours_per_week native_country income
0 not_in_family white male average united_states less_equal_50k 0 not_in_family white male average united_states less_equal_50k
1 husband white male low united_states less_equal_50k 1 husband white male low united_states less_equal_50k
2 not_in_family white male average united_states less_equal_50k 2 not_in_family white male average united_states less_equal_50k
3 husband black male average united_states less_equal_50k 3 husband black male average united_states less_equal_50k
4 wife black female average cuba less_equal_50k 4 wife black female average cuba less_equal_50k
%% Cell type:markdown id:1c091702 tags: %% Cell type:markdown id:1c091702 tags:
### Grouping Native Countries into Regions ### Grouping Native Countries into Regions
To reduce sparsity and simplify ILP rule generation, we group native countries into broader regions such as: To reduce sparsity and simplify ILP rule generation, we group native countries into broader regions such as:
- North America - North America
- Latin America - Latin America
- Asia - Asia
- Europe - Europe
- Middle East - Middle East
- Africa - Africa
- Unknown - Unknown
%% Cell type:code id:4a920d83 tags: %% Cell type:code id:4a920d83 tags:
``` python ``` python
print("🔍 Unique country names BEFORE mapping:") print("🔍 Unique country names BEFORE mapping:")
print(df['native_country'].str.strip().str.lower().unique()) print(df['native_country'].str.strip().str.lower().unique())
``` ```
%% Output %% Output
🔍 Unique country names BEFORE mapping: 🔍 Unique country names BEFORE mapping:
['united_states' 'cuba' 'jamaica' 'india' 'unknown' 'mexico' 'south' ['united_states' 'cuba' 'jamaica' 'india' 'unknown' 'mexico' 'south'
'puerto_rico' 'honduras' 'england' 'canada' 'germany' 'iran' 'puerto_rico' 'honduras' 'england' 'canada' 'germany' 'iran'
'philippines' 'italy' 'poland' 'columbia' 'cambodia' 'thailand' 'ecuador' 'philippines' 'italy' 'poland' 'columbia' 'cambodia' 'thailand' 'ecuador'
'laos' 'taiwan' 'haiti' 'portugal' 'dominican_republic' 'el_salvador' 'laos' 'taiwan' 'haiti' 'portugal' 'dominican_republic' 'el_salvador'
'france' 'guatemala' 'china' 'japan' 'yugoslavia' 'peru' 'france' 'guatemala' 'china' 'japan' 'yugoslavia' 'peru'
'outlying_us(guam_usvi_etc)' 'scotland' 'trinadad&tobago' 'greece' 'outlying_us(guam_usvi_etc)' 'scotland' 'trinadad&tobago' 'greece'
'nicaragua' 'vietnam' 'hong' 'ireland' 'hungary' 'holand_netherlands'] 'nicaragua' 'vietnam' 'hong' 'ireland' 'hungary' 'holand_netherlands']
%% Cell type:code id:05ecbbcb tags: %% Cell type:code id:05ecbbcb tags:
``` python ``` python
# Dropped south because this value looks suspicious and may have been a partial/misentered country name # Dropped south because this value looks suspicious and may have been a partial/misentered country name
df = df[df['native_country'] != 'South'] df = df[df['native_country'] != 'South']
``` ```
%% Cell type:code id:eb25b6e3 tags: %% Cell type:code id:eb25b6e3 tags:
``` python ``` python
# Defining region lists # Defining region lists
north_america = [c.strip().lower() for c in ['united-states', 'canada', 'puerto-rico', 'outlying-us(guam-usvi-etc)']] north_america = [c.strip().lower() for c in ['united-states', 'canada', 'puerto-rico', 'outlying-us(guam-usvi-etc)']]
latin_america = [c.strip().lower() for c in ['mexico', 'cuba', 'jamaica', 'honduras', 'el-salvador', latin_america = [c.strip().lower() for c in ['mexico', 'cuba', 'jamaica', 'honduras', 'el-salvador',
'columbia', 'guatemala', 'nicaragua', 'dominican-republic', 'columbia', 'guatemala', 'nicaragua', 'dominican-republic',
'trinadad&tobago', 'ecuador', 'haiti', 'peru']] 'trinadad&tobago', 'ecuador', 'haiti', 'peru']]
asia = [c.strip().lower() for c in ['india', 'china', 'japan', 'vietnam', 'philippines', 'thailand', asia = [c.strip().lower() for c in ['india', 'china', 'japan', 'vietnam', 'philippines', 'thailand',
'cambodia', 'laos', 'taiwan', 'hong']] 'cambodia', 'laos', 'taiwan', 'hong']]
europe = [c.strip().lower() for c in ['england', 'germany', 'italy', 'poland', 'portugal', 'france', europe = [c.strip().lower() for c in ['england', 'germany', 'italy', 'poland', 'portugal', 'france',
'greece', 'ireland', 'hungary', 'scotland', 'yugoslavia', 'holand-netherlands']] 'greece', 'ireland', 'hungary', 'scotland', 'yugoslavia', 'holand-netherlands']]
middle_east = [c.strip().lower() for c in ['iran']] middle_east = [c.strip().lower() for c in ['iran']]
africa = [c.strip().lower() for c in ['south-africa', 'egypt']] africa = [c.strip().lower() for c in ['south-africa', 'egypt']]
# Defining mapping function # Defining mapping function
def map_country_to_region(country): def map_country_to_region(country):
if country in north_america: if country in north_america:
return 'north_america' return 'north_america'
elif country in latin_america: elif country in latin_america:
return 'latin_america' return 'latin_america'
elif country in asia: elif country in asia:
return 'asia' return 'asia'
elif country in europe: elif country in europe:
return 'europe' return 'europe'
elif country in middle_east: elif country in middle_east:
return 'middle_east' return 'middle_east'
elif country in africa: elif country in africa:
return 'africa' return 'africa'
else: else:
return 'unknown' return 'unknown'
# Applying mapping # Applying mapping
df['native_country'] = df['native_country'].apply(map_country_to_region) df['native_country'] = df['native_country'].apply(map_country_to_region)
``` ```
%% Cell type:markdown id:523d5485 tags: %% Cell type:markdown id:523d5485 tags:
### Stratified Sampling ### Stratified Sampling
To ensure balanced examples for ILP learning, we take an equal number of rows from each class (`<=50K` and `>50K`) for a representative sample. We take 500 balanced rows (250 from each class) To ensure balanced examples for ILP learning, we take an equal number of rows from each class (`<=50K` and `>50K`) for a representative sample. We take 500 balanced rows (250 from each class)
%% Cell type:code id:e9d91ea8 tags: %% Cell type:code id:e9d91ea8 tags:
``` python ``` python
df_sample = df.groupby('income', group_keys=False).apply(lambda x: x.sample(min(250, len(x)), random_state=42)) df_sample = df.groupby('income', group_keys=False).apply(lambda x: x.sample(min(250, len(x)), random_state=42))
df_sample = df_sample.reset_index(drop=True) df_sample = df_sample.reset_index(drop=True)
df_sample['income'].value_counts() df_sample['income'].value_counts()
``` ```
%% Output %% Output
greater_50k 250 greater_50k 250
less_equal_50k 250 less_equal_50k 250
Name: income, dtype: int64 Name: income, dtype: int64
%% Cell type:markdown id:09072a9d tags: %% Cell type:markdown id:09072a9d tags:
### Exploratory Data Analysis (EDA) ### Exploratory Data Analysis (EDA)
%% Cell type:markdown id:61c8c808 tags: %% Cell type:markdown id:61c8c808 tags:
##### 1. Target variable distribution ##### 1. Target variable distribution
%% Cell type:code id:972750dc tags: %% Cell type:code id:972750dc tags:
``` python ``` python
import seaborn as sns import seaborn as sns
import matplotlib.pyplot as plt import matplotlib.pyplot as plt
# Visualizing income distribution # Visualizing income distribution
sns.countplot(x='income', data=df_sample) sns.countplot(x='income', data=df_sample)
plt.title('Distribution of Income Classes') plt.title('Distribution of Income Classes')
plt.xlabel('Income') plt.xlabel('Income')
plt.ylabel('Count') plt.ylabel('Count')
plt.show() plt.show()
``` ```
%% Output %% Output
%% Cell type:markdown id:f85690bf tags: %% Cell type:markdown id:f85690bf tags:
Observation: Observation:
The income classes are evenly distributed in the sampled dataset, with 250 rows each for `<=50k` and `>50k`. This stratified sampling ensures that the ILP system receives a balanced set of positive and negative examples. The income classes are evenly distributed in the sampled dataset, with 250 rows each for `<=50k` and `>50k`. This stratified sampling ensures that the ILP system receives a balanced set of positive and negative examples.
%% Cell type:markdown id:804e67a6 tags: %% Cell type:markdown id:804e67a6 tags:
##### 2. Distribution of categorical features ##### 2. Distribution of categorical features
%% Cell type:code id:1984d074 tags: %% Cell type:code id:1984d074 tags:
``` python ``` python
import matplotlib.pyplot as plt import matplotlib.pyplot as plt
import seaborn as sns import seaborn as sns
cat_cols = ['education', 'workclass', 'occupation', 'marital_status', 'sex', 'race', 'native_country', 'relationship'] cat_cols = ['education', 'workclass', 'occupation', 'marital_status', 'sex', 'race', 'native_country', 'relationship']
fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(18, 12)) fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(18, 12))
axes = axes.flatten() axes = axes.flatten()
for i, col in enumerate(cat_cols): for i, col in enumerate(cat_cols):
sns.countplot(x=col, data=df_sample, ax=axes[i], order=df_sample[col].value_counts().index) sns.countplot(x=col, data=df_sample, ax=axes[i], order=df_sample[col].value_counts().index)
axes[i].set_title(f'{col.title()} Distribution') axes[i].set_title(f'{col.title()} Distribution')
axes[i].tick_params(axis='x', rotation=45) axes[i].tick_params(axis='x', rotation=45)
fig.delaxes(axes[-1]) fig.delaxes(axes[-1])
plt.tight_layout() plt.tight_layout()
plt.show() plt.show()
``` ```
%% Output %% Output
%% Cell type:markdown id:82f3ca0f tags: %% Cell type:markdown id:82f3ca0f tags:
Observation: Observation:
1. Education is dominated by levels like 'hs-grad', 'some-college', and 'bachelors', while very few individuals have 'preschool' or '1st-4th' education. 1. Education is dominated by levels like 'hs-grad', 'some-college', and 'bachelors', while very few individuals have 'preschool' or '1st-4th' education.
2. Workclass shows 'private' as the most common category, followed by 'self-employed' and 'government' roles. 2. Workclass shows 'private' as the most common category, followed by 'self-employed' and 'government' roles.
3. In occupation, technical, administrative, and managerial jobs appear most frequently, while categories like 'armed-forces' are extremely rare. 3. In occupation, technical, administrative, and managerial jobs appear most frequently, while categories like 'armed-forces' are extremely rare.
4. Marital status is primarily 'married' and 'never-married', aligning with a typical working-age population. 4. Marital status is primarily 'married' and 'never-married', aligning with a typical working-age population.
5. Sex distribution shows more males than females in this sample. 5. Sex distribution shows more males than females in this sample.
6. Race is skewed towards 'white', with other races making up a small portion. 6. Race is skewed towards 'white', with other races making up a small portion.
7. Native country is heavily concentrated in the 'united-states', making other countries sparse — a good case for grouping into regions or using 'other'. 7. Native country is heavily concentrated in the 'united-states', making other countries sparse — a good case for grouping into regions or using 'other'.
8. Relationship status is mostly 'husband' and 'not-in-family', likely reflecting primary household earners or individuals living alone. 8. Relationship status is mostly 'husband' and 'not-in-family', likely reflecting primary household earners or individuals living alone.
%% Cell type:markdown id:d90f52be tags: %% Cell type:markdown id:d90f52be tags:
##### 2. Distribution of age and hours_per_week ##### 2. Distribution of age and hours_per_week
%% Cell type:code id:8976173c tags: %% Cell type:code id:8976173c tags:
``` python ``` python
# Plotting distributions of binned numerical features # Plotting distributions of binned numerical features
fig, axes = plt.subplots(1, 2, figsize=(12, 5)) fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Age bin distribution # Age bin distribution
sns.countplot(x='age', data=df_sample, order=['young', 'middle', 'senior'], ax=axes[0]) sns.countplot(x='age', data=df_sample, order=['young', 'middle', 'senior'], ax=axes[0])
axes[0].set_title('Discretized Age Distribution') axes[0].set_title('Discretized Age Distribution')
axes[0].set_xlabel('Age Group') axes[0].set_xlabel('Age Group')
axes[0].set_ylabel('Count') axes[0].set_ylabel('Count')
# Hours-per-week bin distribution # Hours-per-week bin distribution
sns.countplot(x='hours_per_week', data=df_sample, order=['low', 'average', 'high'], ax=axes[1]) sns.countplot(x='hours_per_week', data=df_sample, order=['low', 'average', 'high'], ax=axes[1])
axes[1].set_title('Discretized Hours-per-Week Distribution') axes[1].set_title('Discretized Hours-per-Week Distribution')
axes[1].set_xlabel('Workload Category') axes[1].set_xlabel('Workload Category')
axes[1].set_ylabel('Count') axes[1].set_ylabel('Count')
plt.tight_layout() plt.tight_layout()
plt.show() plt.show()
``` ```
%% Output %% Output
%% Cell type:markdown id:a6cfde4e tags: %% Cell type:markdown id:a6cfde4e tags:
Observation: Observation:
1. The `age` variable shows a majority of individuals in the **middle** age range (31–55), followed by **young** (≤30) and fewer in the **senior** category (>55). 1. The `age` variable shows a majority of individuals in the **middle** age range (31–55), followed by **young** (≤30) and fewer in the **senior** category (>55).
2. The `hours-per-week` variable is also well-distributed, with most individuals working an **average** number of hours (30–50). Very few individuals fall in the **high** or **low** workload groups. 2. The `hours-per-week` variable is also well-distributed, with most individuals working an **average** number of hours (30–50). Very few individuals fall in the **high** or **low** workload groups.
3. This confirms that binning has worked as expected and the categories are reasonably balanced for rule learning in ILP. 3. This confirms that binning has worked as expected and the categories are reasonably balanced for rule learning in ILP.
%% Cell type:code id:10296bac tags: %% Cell type:code id:10296bac tags:
``` python ``` python
df.to_csv('../../data/processed/processed_ilp_aleph_dataset.csv', index=False) df.to_csv('../../data/processed/processed_ilp_aleph_dataset.csv', index=False)
``` ```
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment