Correlation and Feature Engineering Activities

Ref:- http://blog.datadive.net/selecting-good-features-part-i-univariate-selection/

We start by looking into pearson's correlation.

Pearson's correlation is used to find linear association between two continuous variables.

Dataset that i used is titanic survival dataset from location - https://www.kaggle.com/c/titanic/data

Following is the code used to read csv file

import pandas as pd

def read_csv(file_name,delimiter):
        return pd.read_csv(file_name,quotechar='"',delimiter=delimiter,skipinitialspace=True)

def read_csv_as_numpy(file_name,delimiter):
        data_frame = pd.read_csv(file_name,quotechar='"',delimiter=delimiter,skipinitialspace=True)
return data_frame.values

Following is the pearson's implementation code

from scipy.stats import pearsonr
from python_utilities.read_input_pandas import read_csv
import sys

file_name = sys.argv[1]

data_frame = read_csv(file_name,',')

# For Sex vs Survived comparison
#x1=data_frame['Sex'].replace('male',0)
#x1=x1.replace('female',1)
#x1=x1.values

# For Age vs Survived
data_frame = data_frame.dropna()
x1 = data_frame['Age'].values
x2 = data_frame['Survived'].values

print("Comparing between {} and {}".format(x1,x2))
print(pearsonr(x2,x1))

Comparing the results

Sex vs Survived correlation

(0.54335138065775535, 1.4060661308794371e-69)

Age vs Survived correlation

(-0.2540847542030531, 0.00051895033078816846)

Not implementing Distance Correlation ( implemented using python - gist package ) and Maximal Information Coefficient ( implemented using python - minepy package ), these are improvement over pearson's as pearson does not consider the dependency if the dependency is non-linear where as earlier two methods consider it.

Implemented Random Forest Classifer

Code here github.

Output is

[(-0.371, 'Pclass'), (-0.42, 'Fare'), (-0.565, 'Parch'), (-0.735, 'SibSp'), (-0.829, 'Age')]

Not sure if this is correct because from pearson's correlation if i check Fare vs Survived i see below,

(0.25730652238496238, 6.1201893419215696e-15)

this looks to be more logical as i can see from the dataset that as fare reduces number of people who survived also reduces.

correlation and feature engineering

Correlation and Feature Engineering Activities

results matching ""

No results matching ""