Correlation and Feature Engineering Activities

Ref:- http://blog.datadive.net/selecting-good-features-part-i-univariate-selection/

We start by looking into pearson's correlation.

Pearson's correlation is used to find linear association between two continuous variables.

Dataset that i used is titanic survival dataset from location - https://www.kaggle.com/c/titanic/data

Following is the code used to read csv file

import pandas as pd

def read_csv(file_name,delimiter):
        return pd.read_csv(file_name,quotechar='"',delimiter=delimiter,skipinitialspace=True)

def read_csv_as_numpy(file_name,delimiter):
        data_frame = pd.read_csv(file_name,quotechar='"',delimiter=delimiter,skipinitialspace=True)
return data_frame.values

Following is the pearson's implementation code

from scipy.stats import pearsonr
from python_utilities.read_input_pandas import read_csv
import sys

file_name = sys.argv[1]

data_frame = read_csv(file_name,',')

# For Sex vs Survived comparison
#x1=data_frame['Sex'].replace('male',0)
#x1=x1.replace('female',1)
#x1=x1.values

# For Age vs Survived
data_frame = data_frame.dropna()
x1 = data_frame['Age'].values
x2 = data_frame['Survived'].values

print("Comparing between {} and {}".format(x1,x2))
print(pearsonr(x2,x1))

Comparing the results

Sex vs Survived correlation

(0.54335138065775535, 1.4060661308794371e-69)

Age vs Survived correlation

(-0.2540847542030531, 0.00051895033078816846)

Not implementing Distance Correlation ( implemented using python - gist package ) and Maximal Information Coefficient ( implemented using python - minepy package ), these are improvement over pearson's as pearson does not consider the dependency if the dependency is non-linear where as earlier two methods consider it.

Implemented Random Forest Classifer

Code here github.

Output is

[(-0.371, 'Pclass'), (-0.42, 'Fare'), (-0.565, 'Parch'), (-0.735, 'SibSp'), (-0.829, 'Age')]

Not sure if this is correct because from pearson's correlation if i check Fare vs Survived i see below,

(0.25730652238496238, 6.1201893419215696e-15)

this looks to be more logical as i can see from the dataset that as fare reduces number of people who survived also reduces.

results matching ""

    No results matching ""