Correlation and Feature Engineering Activities
Ref:- http://blog.datadive.net/selecting-good-features-part-i-univariate-selection/
We start by looking into pearson's correlation.
Pearson's correlation is used to find linear association between two continuous variables.
Dataset that i used is titanic survival dataset from location - https://www.kaggle.com/c/titanic/data
Following is the code used to read csv file
import pandas as pd
def read_csv(file_name,delimiter):
return pd.read_csv(file_name,quotechar='"',delimiter=delimiter,skipinitialspace=True)
def read_csv_as_numpy(file_name,delimiter):
data_frame = pd.read_csv(file_name,quotechar='"',delimiter=delimiter,skipinitialspace=True)
return data_frame.values
Following is the pearson's implementation code
from scipy.stats import pearsonr
from python_utilities.read_input_pandas import read_csv
import sys
file_name = sys.argv[1]
data_frame = read_csv(file_name,',')
# For Sex vs Survived comparison
#x1=data_frame['Sex'].replace('male',0)
#x1=x1.replace('female',1)
#x1=x1.values
# For Age vs Survived
data_frame = data_frame.dropna()
x1 = data_frame['Age'].values
x2 = data_frame['Survived'].values
print("Comparing between {} and {}".format(x1,x2))
print(pearsonr(x2,x1))
Comparing the results
Sex vs Survived correlation
(0.54335138065775535, 1.4060661308794371e-69)
Age vs Survived correlation
(-0.2540847542030531, 0.00051895033078816846)
Not implementing Distance Correlation ( implemented using python - gist package ) and Maximal Information Coefficient ( implemented using python - minepy package ), these are improvement over pearson's as pearson does not consider the dependency if the dependency is non-linear where as earlier two methods consider it.
Implemented Random Forest Classifer
Code here github.
Output is
[(-0.371, 'Pclass'), (-0.42, 'Fare'), (-0.565, 'Parch'), (-0.735, 'SibSp'), (-0.829, 'Age')]
Not sure if this is correct because from pearson's correlation if i check Fare vs Survived i see below,
(0.25730652238496238, 6.1201893419215696e-15)
this looks to be more logical as i can see from the dataset that as fare reduces number of people who survived also reduces.