To ensure balanced examples for ILP learning, we take an equal number of rows from each class (`<=50K` and `>50K`) for a representative sample. We take 500 balanced rows (250 from each class)
To ensure balanced examples for ILP learning, we take an equal number of rows from each class (`<=50K` and `>50K`) for a representative sample. We take 500 balanced rows (250 from each class)
The income classes are evenly distributed in the sampled dataset, with 250 rows each for `<=50k` and `>50k`. This stratified sampling ensures that the ILP system receives a balanced set of positive and negative examples.
The income classes are evenly distributed in the sampled dataset, with 250 rows each for `<=50k` and `>50k`. This stratified sampling ensures that the ILP system receives a balanced set of positive and negative examples.
1. Education is dominated by levels like 'hs-grad', 'some-college', and 'bachelors', while very few individuals have 'preschool' or '1st-4th' education.
1. Education is dominated by levels like 'hs-grad', 'some-college', and 'bachelors', while very few individuals have 'preschool' or '1st-4th' education.
2. Workclass shows 'private' as the most common category, followed by 'self-employed' and 'government' roles.
2. Workclass shows 'private' as the most common category, followed by 'self-employed' and 'government' roles.
3. In occupation, technical, administrative, and managerial jobs appear most frequently, while categories like 'armed-forces' are extremely rare.
3. In occupation, technical, administrative, and managerial jobs appear most frequently, while categories like 'armed-forces' are extremely rare.
4. Marital status is primarily 'married' and 'never-married', aligning with a typical working-age population.
4. Marital status is primarily 'married' and 'never-married', aligning with a typical working-age population.
5. Sex distribution shows more males than females in this sample.
5. Sex distribution shows more males than females in this sample.
6. Race is skewed towards 'white', with other races making up a small portion.
6. Race is skewed towards 'white', with other races making up a small portion.
7. Native country is heavily concentrated in the 'united-states', making other countries sparse — a good case for grouping into regions or using 'other'.
7. Native country is heavily concentrated in the 'united-states', making other countries sparse — a good case for grouping into regions or using 'other'.
8. Relationship status is mostly 'husband' and 'not-in-family', likely reflecting primary household earners or individuals living alone.
8. Relationship status is mostly 'husband' and 'not-in-family', likely reflecting primary household earners or individuals living alone.
%% Cell type:markdown id:d90f52be tags:
%% Cell type:markdown id:d90f52be tags:
##### 2. Distribution of age and hours_per_week
##### 2. Distribution of age and hours_per_week
%% Cell type:code id:8976173c tags:
%% Cell type:code id:8976173c tags:
``` python
``` python
# Plotting distributions of binned numerical features
# Plotting distributions of binned numerical features
1. The `age` variable shows a majority of individuals in the **middle** age range (31–55), followed by **young** (≤30) and fewer in the **senior** category (>55).
1. The `age` variable shows a majority of individuals in the **middle** age range (31–55), followed by **young** (≤30) and fewer in the **senior** category (>55).
2. The `hours-per-week` variable is also well-distributed, with most individuals working an **average** number of hours (30–50). Very few individuals fall in the **high** or **low** workload groups.
2. The `hours-per-week` variable is also well-distributed, with most individuals working an **average** number of hours (30–50). Very few individuals fall in the **high** or **low** workload groups.
3. This confirms that binning has worked as expected and the categories are reasonably balanced for rule learning in ILP.
3. This confirms that binning has worked as expected and the categories are reasonably balanced for rule learning in ILP.