Adult Income Data Set Analysis with IPython
In this blog post I will show you how to slice-n-dice the data set from Adult Data Set MLR which contains income data for about 32000 people. We will look at the data and build a machine learning model (a logistic regression), which tries to predict if a person will make more than $50K a year, given data like education, gender and martial status.
Let’s first import some libraries that we are going to need for our analysis
First we need to read the data from the file, which contains comma separated columns. With the command below we will read the data skipping any spaces before/after the commas and mark the values ‘?’ as missing data points.
Load the data
|Age||Workclass||fnlwgt||Education||Education-Num||Martial Status||Occupation||Relationship||Race||Sex||Capital Gain||Capital Loss||Hours per week||Country||Target|
Analyze the data
Let’s plot the distribution of each feature, so that we have a better understanding what we have in our data. We draw the number of values for each category feature and the histogram of the values for each continuous feature.
As you can see from the plots above our data is mostly concentrated in the USA with mostly male white people. This is a good thing to notice, as it may impact the conclusions we come to later.
United-States 0.895857 Mexico 0.019748 Philippines 0.006081 Germany 0.004207 Canada 0.003716 dtype: float64
Indeed! 89% of the samples are for people from the US. Mexico comes next with less than 2%.
Now, let’s explore something else. The correlation between the different
features. Generally it is not a good idea to have many correlated features, as
it might be a sign that your data is not very good. For this purpose
we will need to encode the categorical features as numbers. This can be done
LabelEncoder in the
We see there is a high correlation between
Let’s look at these columns
As you can see these two columns actually represent the same features, but
encoded as strings and as numbers. We don’t need the string representation, so
we can just delete this column. Note that it is a much better option to delete
Education column as the
Education-Num has the important property that
the values are ordered: the higher the number, the higher the education that
person has. This is a vaulable information a machine learning algorithm can use.
So it seems that the data is mostly OK with the exception of
Relationship, which seems to be negatively correlated. Let’s explore that for
Yes. The data looks correlated, because for example
highly correlated values, as well as
Wife. There is no easy way
to tackle this problem, so let’s carry on.
Build a classifier
Now that we explored our data, let’s try to build a classifier which tries to predict what will be the income of a given person given the features we have in our dataset.
First we need to encode the features as numbers as the classifiers cannot work with string features. As we saw a while ago this can be achieved easily with the function we defined earlier. Let’s encode the data and show the histograms of the values again.
As you can see we have our data properly encoded and it seems to make sense. Now, let’s try to build a classifier for it. Before we do that, let’s split the data into a train and test set. This is a common approach to avoid overfitting. If we train and test the classifiers on the same data we will always get awesome results and we will most probably overfit the model. However if we test a classifier on data it has never seen we can be more confident it will perform better when ran on new data.
Split and scale the features
Most machine learning algorithms like the features to be scaled with mean 0 and
variance 1. This is called “removing the mean and scaling to unit variance”.
This can be easily done with the
scale the features and look at them again.
F1 score: 0.573306
As you can see we managed to achieve F1 score of 0.57 and the features that
seems to contribute most positively to have an income of more than $50K are
Sex, while the features that contribute
most negatively are
Martia Status and
Relationship. There is a problem here,
though. Features like
Martial Status have values ranging from 0 to 6 and the
order is really important here. In practice there is no particular order in that
Education-Num for which the higher the number, the better the
education). We can fix this using binary features.
Classify using binary features
As a last step we can try to improve our classifier using binary attributes. Our
current approach for encoding our data has the drawback that we put arbitrary
order in our classes. For example we encode
Relationship with a number between
1 and 5 and the logistic regression interprets these values as continuous
variables and plugs them into an optimization function. This will cause
different classes to have different weight into our model, which is not correct.
Each class is theoretically equally weighted compared to the rest of the
classes. In order to fix this we can use dummy variables.
Now we have a bunch of features that have only the values 0 and 1. There is a lot of correlation between some of them, but let’s not look at this for now (for example Male and Female are negatively correlated).
Logistic Regression with dummy variables
F1 score: 0.651455
We managed to improve the F1 score significantly by converting the data to use
dummy variables. Also it seems that we managed to uncover some interesting
insight from our model. It seems that the features that impacts the income of a
person positively are
Exec-managerial. The features that impact it most negatively are
Divorsed and unfortunately
Female. One more proof that there is a gender inequality in our society.
As you can see we not only managed to build a machine learning model that we can use to classify new data, but we also managed to uncover some interesting insight from our data. This is one of the nice features of the linear models. They are not “black boxes”, like neural networks for example and allow to see what exactly the model is doing.
You can download the ipython notebook, which was used for generating the above analysis from here.