GitHub - jamjavad/Titanic-Project: Titanic data analysis project for survival prediction using Python

Titanic Survival Prediction

Machine Learning project to predict passenger survival on the Titanic

import pandas as pd
import numpy as np

train = pd.read_csv('../input/train.csv',index_col=0)
test  = pd.read_csv('../input/test.csv')

train.isnull().sum()
print('Train Shape:', train.shape)
test.isnull().sum()
print('Test Shape:', test.shape)

Train Shape: (891, 11)
Test Shape: (418, 11)

train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 891 entries, 1 to 891
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       714 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Ticket    891 non-null    object 
 8   Fare      891 non-null    float64
 9   Cabin     204 non-null    object 
 10  Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB

test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.1+ KB

Data Dictionary

Survived: 0=NO , 1=Yes
pcalss: Ticket class 1=1st , 2=2nd , 3=rd
sibsp: of siblings / spouses aboard the Titanic
parch: of parents / childern aboard the Titanic
ticket: Titanic number
cabin: Cabin number
embarked: Port of Embarkation C=Cherbourg , Q=Queenstown , S=Southampton

Total rows and columns

We can see that there are 891 rows and 12 colmns in our training dataset

train.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
PassengerId
1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S

train.describe()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

test.describe()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	PassengerId	Pclass	Age	SibSp	Parch	Fare
count	418.000000	418.000000	332.000000	418.000000	418.000000	417.000000
mean	1100.500000	2.265550	30.272590	0.447368	0.392344	35.627188
std	120.810458	0.841838	14.181209	0.896760	0.981429	55.907576
min	892.000000	1.000000	0.170000	0.000000	0.000000	0.000000
25%	996.250000	1.000000	21.000000	0.000000	0.000000	7.895800
50%	1100.500000	3.000000	27.000000	0.000000	0.000000	14.454200
75%	1204.750000	3.000000	39.000000	1.000000	0.000000	31.500000
max	1309.000000	3.000000	76.000000	8.000000	9.000000	512.329200

train.isnull().sum()

Survived      0
Pclass        0
Name          0
Sex           0
Age         177
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       687
Embarked      2
dtype: int64

test.isnull().sum()
test['Survived']=''
test.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	892	3	Kelly, Mr. James	male	34.5	0	0	330911	7.8292	NaN	Q
1	893	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.0000	NaN	S
2	894	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.6875	NaN	Q
3	895	3	Wirz, Mr. Albert	male	27.0	0	0	315154	8.6625	NaN	S
4	896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.0	1	1	3101298	12.2875	NaN	S

Data Visualization using Matplotlib and Seaborn packages

import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

Bar Chat for Categorical Features

pclass
Sex
SibSP
Parch
Embarked
Cabin

def bar_chart(feature):
    
    #calculate data 
    Survived = train[train['Survived'] ==1][feature].value_counts()
    dead = train[train['Survived']==0][feature].value_counts()
    
    #Display number
    print(f'Survived:\n{Survived}')
    print(f'Dead:\n{dead}')
    
    #Create and display chart
    df = pd.DataFrame([Survived,dead])
    df.index = ['Survived','Dead']
    df.plot(kind='bar' ,stacked=True , figsize=(10,5))
    plt.xticks(rotation=45)

bar_chart('Sex')

Survived:
Sex
female    233
male      109
Name: count, dtype: int64
Dead:
Sex
male      468
female     81
Name: count, dtype: int64

the chart confirms women more likely survived than men

bar_chart('Pclass')

Survived:
Pclass
1    136
3    119
2     87
Name: count, dtype: int64
Dead:
Pclass
3    372
2     97
1     80
Name: count, dtype: int64

the chart confirms 1st class more likely survived than othr calss

the chart confirms 3st class moer likely dead than othr calss

bar_chart('SibSp')

Survived:
SibSp
0    210
1    112
2     13
3      4
4      3
Name: count, dtype: int64
Dead:
SibSp
0    398
1     97
4     15
2     15
3     12
8      7
5      5
Name: count, dtype: int64

the chart confirms a person aboarded with more than 2 siblings or spouse more likely survived

the chart confirms a person aboarded without siblings or spouse more likely dead

bar_chart('Parch')

Survived:
Parch
0    233
1     65
2     40
3      3
5      1
Name: count, dtype: int64
Dead:
Parch
0    445
1     53
2     40
5      4
4      4
3      2
6      1
Name: count, dtype: int64

The chart confirms a person aboarded with more than 2 parents or children more likely survived

The chart confirms a person aboarded alone more likely dead

bar_chart('Embarked')

Survived:
Embarked
S    217
C     93
Q     30
Name: count, dtype: int64
Dead:
Embarked
S    427
C     75
Q     47
Name: count, dtype: int64

The Chart confirms a person aboarded from C slightly more likely survived

The Chart confirms a person aboarded from Q more likely dead

The Chart confirms a person aboarded from S more likely dead

4. Feature engineering

Feature engineering is the process of using domain knowledge of the data to create features (feature vectors) that make machine learning algorithms work.

feature vector is an n-dimensional vector of numerical features that represent some object. Many algorithms in machine learning require a numerical representation of objects, since such representations facilitate processing and statistical analysis.

train.head(10)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
PassengerId
1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
6	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
8	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN	S
9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN	S
10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	NaN	C

# Combine dataset
# Combine the training and test datasets
train_test_data = [train,test]

# Extract Titles from Names 
for dataset in train_test_data:
    dataset['Title'] = dataset['Name'].str.extract(r' ([A-Za-z]+)\.', expand=False)

# Calculate and display the count of each title in the 'Title' column of the train DataFrame
train['Title'].value_counts()

Title
Mr          517
Miss        182
Mrs         125
Master       40
Dr            7
Rev           6
Mlle          2
Major         2
Col           2
Countess      1
Capt          1
Ms            1
Sir           1
Lady          1
Mme           1
Don           1
Jonkheer      1
Name: count, dtype: int64

# Calculate and display the count of each title in the 'Title' column of the train DataFrame
test['Title'].value_counts()

Title
Mr        240
Miss       78
Mrs        72
Master     21
Col         2
Rev         2
Ms          1
Dr          1
Dona        1
Name: count, dtype: int64

Title Map

Mr : 0
Miss : 1
Mrs: 2
Others: 3

title_mapping = {"Mr": 0, "Miss": 1, "Mrs": 2, 
                 "Master": 3, "Dr": 3, "Rev": 3, "Col": 3, "Major": 3, "Mlle": 3,"Countess": 3,
                 "Ms": 3, "Lady": 3, "Jonkheer": 3, "Don": 3, "Dona" : 3, "Mme": 3,"Capt": 3,"Sir": 3 }

for dataset in train_test_data:
    dataset['Title'] = dataset["Title"].map(title_mapping)

dataset.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Title
0	892	3	Kelly, Mr. James	male	34.5	0	0	330911	7.8292	NaN	Q	0
1	893	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.0000	NaN	S	2
2	894	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.6875	NaN	Q	0
3	895	3	Wirz, Mr. Albert	male	27.0	0	0	315154	8.6625	NaN	S	0
4	896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.0	1	1	3101298	12.2875	NaN	S	2

test.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Title
0	892	3	Kelly, Mr. James	male	34.5	0	0	330911	7.8292	NaN	Q	0
1	893	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.0000	NaN	S	2
2	894	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.6875	NaN	Q	0
3	895	3	Wirz, Mr. Albert	male	27.0	0	0	315154	8.6625	NaN	S	0
4	896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.0	1	1	3101298	12.2875	NaN	S	2

bar_chart('Title')

Survived:
Title
1    127
2     99
0     81
3     35
Name: count, dtype: int64
Dead:
Title
0    436
1     55
3     32
2     26
Name: count, dtype: int64

train.drop('Name', axis=1, inplace=True)
test.drop('Name', axis=1, inplace=True)

train.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Survived	Pclass	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Title
PassengerId
1	0	3	male	22.0	1	0	A/5 21171	7.2500	NaN	S	0
2	1	1	female	38.0	1	0	PC 17599	71.2833	C85	C	2
3	1	3	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S	1
4	1	1	female	35.0	1	0	113803	53.1000	C123	S	2
5	0	3	male	35.0	0	0	373450	8.0500	NaN	S	0

sex_mapping = {"male": 0, "female": 1}
for dataset in train_test_data:
    dataset['Sex'] = dataset['Sex'].map(sex_mapping)

bar_chart('Sex')

Survived:
Sex
1    233
0    109
Name: count, dtype: int64
Dead:
Sex
0    468
1     81
Name: count, dtype: int64

test.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	PassengerId	Pclass	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Title
0	892	3	0	34.5	0	0	330911	7.8292	NaN	Q	0
1	893	3	1	47.0	1	0	363272	7.0000	NaN	S	2
2	894	2	0	62.0	0	0	240276	9.6875	NaN	Q	0
3	895	3	0	27.0	0	0	315154	8.6625	NaN	S	0
4	896	3	1	22.0	1	1	3101298	12.2875	NaN	S	2

train["Age"].fillna(train.groupby("Title")["Age"].transform("median"), inplace= True)
test["Age"].fillna(test.groupby('Title')['Age'].transform("median"), inplace= True)

train.head(30)
#train.groupby("Title")["Age"].transform("median")

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Survived	Pclass	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Title
PassengerId
1	0	3	0	22.0	1	0	A/5 21171	7.2500	NaN	S	0
2	1	1	1	38.0	1	0	PC 17599	71.2833	C85	C	2
3	1	3	1	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S	1
4	1	1	1	35.0	1	0	113803	53.1000	C123	S	2
5	0	3	0	35.0	0	0	373450	8.0500	NaN	S	0
6	0	3	0	30.0	0	0	330877	8.4583	NaN	Q	0
7	0	1	0	54.0	0	0	17463	51.8625	E46	S	0
8	0	3	0	2.0	3	1	349909	21.0750	NaN	S	3
9	1	3	1	27.0	0	2	347742	11.1333	NaN	S	2
10	1	2	1	14.0	1	0	237736	30.0708	NaN	C	2
11	1	3	1	4.0	1	1	PP 9549	16.7000	G6	S	1
12	1	1	1	58.0	0	0	113783	26.5500	C103	S	1
13	0	3	0	20.0	0	0	A/5. 2151	8.0500	NaN	S	0
14	0	3	0	39.0	1	5	347082	31.2750	NaN	S	0
15	0	3	1	14.0	0	0	350406	7.8542	NaN	S	1
16	1	2	1	55.0	0	0	248706	16.0000	NaN	S	2
17	0	3	0	2.0	4	1	382652	29.1250	NaN	Q	3
18	1	2	0	30.0	0	0	244373	13.0000	NaN	S	0
19	0	3	1	31.0	1	0	345763	18.0000	NaN	S	2
20	1	3	1	35.0	0	0	2649	7.2250	NaN	C	2
21	0	2	0	35.0	0	0	239865	26.0000	NaN	S	0
22	1	2	0	34.0	0	0	248698	13.0000	D56	S	0
23	1	3	1	15.0	0	0	330923	8.0292	NaN	Q	1
24	1	1	0	28.0	0	0	113788	35.5000	A6	S	0
25	0	3	1	8.0	3	1	349909	21.0750	NaN	S	1
26	1	3	1	38.0	1	5	347077	31.3875	NaN	S	2
27	0	3	0	30.0	0	0	2631	7.2250	NaN	C	0
28	0	1	0	19.0	3	2	19950	263.0000	C23 C25 C27	S	0
29	1	3	1	21.0	0	0	330959	7.8792	NaN	Q	1
30	0	3	0	30.0	0	0	349216	7.8958	NaN	S	0

facet = sns.FacetGrid(train, hue="Survived",aspect=4)
facet.map(sns.kdeplot,'Age',fill= True)
facet.set(xlim=(0, train['Age'].max()))
facet.add_legend() 
plt.show()

facet = sns.FacetGrid(train, hue="Survived",aspect=4)
facet.map(sns.kdeplot,'Age',fill= True)
facet.set(xlim=(0, train['Age'].max()))
facet.add_legend() 
plt.xlim(10,50)

(10.0, 50.0)

Those who were 20 to 30 years old were more dead and more survived.

train.info()
test.info()

<class 'pandas.core.frame.DataFrame'>
Index: 891 entries, 1 to 891
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Sex       891 non-null    int64  
 3   Age       891 non-null    float64
 4   SibSp     891 non-null    int64  
 5   Parch     891 non-null    int64  
 6   Ticket    891 non-null    object 
 7   Fare      891 non-null    float64
 8   Cabin     204 non-null    object 
 9   Embarked  889 non-null    object 
 10  Title     891 non-null    int64  
dtypes: float64(2), int64(6), object(3)
memory usage: 83.5+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Sex          418 non-null    int64  
 3   Age          418 non-null    float64
 4   SibSp        418 non-null    int64  
 5   Parch        418 non-null    int64  
 6   Ticket       418 non-null    object 
 7   Fare         417 non-null    float64
 8   Cabin        91 non-null     object 
 9   Embarked     418 non-null    object 
 10  Survived     418 non-null    object 
 11  Title        418 non-null    int64  
dtypes: float64(2), int64(6), object(4)
memory usage: 39.3+ KB

Binning

Binning/Converting Numerical Age to Categorical Variable

feature vector map:

child: 0
young: 1
adult: 2
mid-age: 3
senior: 4

train.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Survived	Pclass	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Title
PassengerId
1	0	3	0	22.0	1	0	A/5 21171	7.2500	NaN	S	0
2	1	1	1	38.0	1	0	PC 17599	71.2833	C85	C	2
3	1	3	1	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S	1
4	1	1	1	35.0	1	0	113803	53.1000	C123	S	2
5	0	3	0	35.0	0	0	373450	8.0500	NaN	S	0

for dataset in train_test_data:
    dataset['Age'] = pd.cut(dataset['Age'],
                           bins=[0,16,26,36,62, float('inf')],
                           labels=[0,1,2,3,4],
                           include_lowest=True)

train.head()
bar_chart('Age')

Survived:
Age
2    116
1     97
3     69
0     57
4      3
Name: count, dtype: int64
Dead:
Age
2    220
1    158
3    111
0     48
4     12
Name: count, dtype: int64

Pclass1 = train[train['Pclass'] == 1]['Embarked'].value_counts()
Pclass2 = train[train['Pclass'] == 2]['Embarked'].value_counts()
Pclass3 = train[train['Pclass'] == 3]['Embarked'].value_counts()
df = pd.DataFrame([Pclass1,Pclass2,Pclass3])
df.index = ['1st Class','2nd Class','3rd Class']
df.plot(kind = 'bar', stacked =  True, figsize=(10,5))
plt.show()
print("Pclass1:\n",Pclass1)
print("Pclass2:\n",Pclass2)
print("Pclass3:\n",Pclass3)

Pclass1:
 Embarked
S    127
C     85
Q      2
Name: count, dtype: int64
Pclass2:
 Embarked
S    164
C     17
Q      3
Name: count, dtype: int64
Pclass3:
 Embarked
S    353
Q     72
C     66
Name: count, dtype: int64

more than 50 % of 1st class are from S embark.

more than 50 % of 2st class are from S embark.

more than 50 % of 3st class are from S embark.

fill out missing embark with S embark

for dataset in train_test_data:
    dataset['Embarked'] =  dataset['Embarked'].fillna('S')

train.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Survived	Pclass	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Title
PassengerId
1	0	3	0	1	1	0	A/5 21171	7.2500	NaN	S	0
2	1	1	1	3	1	0	PC 17599	71.2833	C85	C	2
3	1	3	1	1	0	0	STON/O2. 3101282	7.9250	NaN	S	1
4	1	1	1	2	1	0	113803	53.1000	C123	S	2
5	0	3	0	2	0	0	373450	8.0500	NaN	S	0

embarked_mapping = {'S':0,'C':1,'Q':2}
for dataset in train_test_data:
    dataset['Embarked'] = dataset['Embarked'].map(embarked_mapping)

# train["Fare"].fillna(train.groupby("Pclass")["Fare"])
# train["Fare"].fillna(train.groupby("Pclass")["Fare"].transform("median"), inplace = True)
# test["Fare"].fillna(test.groupby("Pclass")["Fare"].transform("median"), inplace = True)
# train.head(50)


# fill missing Fare with median fare for each Pclass
train["Fare"].fillna(train.groupby("Pclass")["Fare"].transform("median"), inplace=True)
test["Fare"].fillna(test.groupby("Pclass")["Fare"].transform("median"), inplace=True)
train.head(50)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Survived	Pclass	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Title
PassengerId
1	0	3	0	1	1	0	A/5 21171	7.2500	NaN	0	0
2	1	1	1	3	1	0	PC 17599	71.2833	C85	1	2
3	1	3	1	1	0	0	STON/O2. 3101282	7.9250	NaN	0	1
4	1	1	1	2	1	0	113803	53.1000	C123	0	2
5	0	3	0	2	0	0	373450	8.0500	NaN	0	0
6	0	3	0	2	0	0	330877	8.4583	NaN	2	0
7	0	1	0	3	0	0	17463	51.8625	E46	0	0
8	0	3	0	0	3	1	349909	21.0750	NaN	0	3
9	1	3	1	2	0	2	347742	11.1333	NaN	0	2
10	1	2	1	0	1	0	237736	30.0708	NaN	1	2
11	1	3	1	0	1	1	PP 9549	16.7000	G6	0	1
12	1	1	1	3	0	0	113783	26.5500	C103	0	1
13	0	3	0	1	0	0	A/5. 2151	8.0500	NaN	0	0
14	0	3	0	3	1	5	347082	31.2750	NaN	0	0
15	0	3	1	0	0	0	350406	7.8542	NaN	0	1
16	1	2	1	3	0	0	248706	16.0000	NaN	0	2
17	0	3	0	0	4	1	382652	29.1250	NaN	2	3
18	1	2	0	2	0	0	244373	13.0000	NaN	0	0
19	0	3	1	2	1	0	345763	18.0000	NaN	0	2
20	1	3	1	2	0	0	2649	7.2250	NaN	1	2
21	0	2	0	2	0	0	239865	26.0000	NaN	0	0
22	1	2	0	2	0	0	248698	13.0000	D56	0	0
23	1	3	1	0	0	0	330923	8.0292	NaN	2	1
24	1	1	0	2	0	0	113788	35.5000	A6	0	0
25	0	3	1	0	3	1	349909	21.0750	NaN	0	1
26	1	3	1	3	1	5	347077	31.3875	NaN	0	2
27	0	3	0	2	0	0	2631	7.2250	NaN	1	0
28	0	1	0	1	3	2	19950	263.0000	C23 C25 C27	0	0
29	1	3	1	1	0	0	330959	7.8792	NaN	2	1
30	0	3	0	2	0	0	349216	7.8958	NaN	0	0
31	0	1	0	3	0	0	PC 17601	27.7208	NaN	1	3
32	1	1	1	2	1	0	PC 17569	146.5208	B78	1	2
33	1	3	1	1	0	0	335677	7.7500	NaN	2	1
34	0	2	0	4	0	0	C.A. 24579	10.5000	NaN	0	0
35	0	1	0	2	1	0	PC 17604	82.1708	NaN	1	0
36	0	1	0	3	1	0	113789	52.0000	NaN	0	0
37	1	3	0	2	0	0	2677	7.2292	NaN	1	0
38	0	3	0	1	0	0	A./5. 2152	8.0500	NaN	0	0
39	0	3	1	1	2	0	345764	18.0000	NaN	0	1
40	1	3	1	0	1	0	2651	11.2417	NaN	1	1
41	0	3	1	3	1	0	7546	9.4750	NaN	0	2
42	0	2	1	2	1	0	11668	21.0000	NaN	0	2
43	0	3	0	2	0	0	349253	7.8958	NaN	1	0
44	1	2	1	0	1	2	SC/Paris 2123	41.5792	NaN	1	1
45	1	3	1	1	0	0	330958	7.8792	NaN	2	1
46	0	3	0	2	0	0	S.C./A.4. 23567	8.0500	NaN	0	0
47	0	3	0	2	1	0	370371	15.5000	NaN	2	0
48	1	3	1	1	0	0	14311	7.7500	NaN	2	1
49	0	3	0	2	2	0	2662	21.6792	NaN	1	0
50	0	3	1	1	1	0	349237	17.8000	NaN	0	2

facet = sns.FacetGrid(train, hue="Survived",aspect=4 )
facet.map(sns.kdeplot, 'Fare', fill = True)
facet.set(xlim = (0, train['Fare'].max()))
facet.add_legend()
plt.show()

facet = sns.FacetGrid(train, hue="Survived",aspect=4)
facet.map(sns.kdeplot,'Fare',fill= True)
facet.set(xlim=(0, train['Fare'].max()))
facet.add_legend()
plt.xlim(0, 20)

(0.0, 20.0)

facet = sns.FacetGrid(train, hue="Survived",aspect=4)
facet.map(sns.kdeplot,'Fare',fill= True)
facet.set(xlim=(0, train['Fare'].max()))
facet.add_legend()
plt.xlim(0, 20)

(0.0, 20.0)

train.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Survived	Pclass	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Title
PassengerId
1	0	3	0	1	1	0	A/5 21171	7.2500	NaN	0	0
2	1	1	1	3	1	0	PC 17599	71.2833	C85	1	2
3	1	3	1	1	0	0	STON/O2. 3101282	7.9250	NaN	0	1
4	1	1	1	2	1	0	113803	53.1000	C123	0	2
5	0	3	0	2	0	0	373450	8.0500	NaN	0	0

train.Cabin.value_counts()

Cabin
B96 B98        4
G6             4
C23 C25 C27    4
C22 C26        3
F33            3
              ..
E34            1
C7             1
C54            1
E36            1
C148           1
Name: count, Length: 147, dtype: int64

for dataset in train_test_data:
    dataset['Cabin'] =  dataset['Cabin'].str[:1]

Pclass1 = train[train['Pclass']==1]['Cabin'].value_counts()
Pclass2 = train[train['Pclass']==2]['Cabin'].value_counts()
Pclass3 = train[train['Pclass']==3]['Cabin'].value_counts()
df = pd.DataFrame([Pclass1, Pclass2, Pclass3])
df.index = ['1st class','2nd class', '3rd class']
df.plot(kind='bar',stacked=True, figsize=(10,5))

<AxesSubplot: >

cabin_mapping = {"A": 0, "B": 0.4, "C": 0.8, "D": 1.2, "E": 1.6, "F": 2, "G": 2.4, "T": 2.8}
for dataset in train_test_data:
    dataset['Cabin'] = dataset['Cabin'].map(cabin_mapping)

# fill missing Fare with median fare for each Pclass
train["Cabin"].fillna(train.groupby("Pclass")["Cabin"].transform("median"), inplace=True)
test["Cabin"].fillna(test.groupby("Pclass")["Cabin"].transform("median"), inplace=True)

family Size

train["FamilySize"] = train["SibSp"] + train["Parch"] + 1
test["FamilySize"] = test["SibSp"] + test["Parch"] + 1

facet = sns.FacetGrid(train, hue="Survived",aspect=4)
facet.map(sns.kdeplot,'FamilySize',fill= True)
facet.set(xlim=(0, train['FamilySize'].max()))
facet.add_legend()
plt.xlim(0)

(0.0, 11.0)

family_mapping = {1: 0, 2: 0.4, 3: 0.8, 4: 1.2, 5: 1.6, 6: 2, 7: 2.4, 8: 2.8, 9: 3.2, 10: 3.6, 11: 4}
for dataset in train_test_data:
    dataset['FamilySize'] = dataset['FamilySize'].map(family_mapping)

train.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Survived	Pclass	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Title	FamilySize
PassengerId
1	0	3	0	1	1	0	A/5 21171	7.2500	2.0	0	0	0.4
2	1	1	1	3	1	0	PC 17599	71.2833	0.8	1	2	0.4
3	1	3	1	1	0	0	STON/O2. 3101282	7.9250	2.0	0	1	0.0
4	1	1	1	2	1	0	113803	53.1000	0.8	0	2	0.4
5	0	3	0	2	0	0	373450	8.0500	2.0	0	0	0.0

features_drop = ['Ticket','SibSp','Parch']
train = train.drop(features_drop, axis = 1)
test = test.drop(features_drop,axis=1)

train_data = train.drop('Survived', axis = 1)
target = train['Survived']
train_data.shape, target.shape

((891, 8), (891,))

train_data.head(10)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Pclass	Sex	Age	Fare	Cabin	Embarked	Title	FamilySize
PassengerId
1	3	0	1	7.2500	2.0	0	0	0.4
2	1	1	3	71.2833	0.8	1	2	0.4
3	3	1	1	7.9250	2.0	0	1	0.0
4	1	1	2	53.1000	0.8	0	2	0.4
5	3	0	2	8.0500	2.0	0	0	0.0
6	3	0	2	8.4583	2.0	2	0	0.0
7	1	0	3	51.8625	1.6	0	0	0.0
8	3	0	0	21.0750	2.0	0	3	1.6
9	3	1	2	11.1333	2.0	0	2	0.8
10	2	1	0	30.0708	1.8	1	2	0.4

5. Modelling

# Importing Classifier Modules
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier,ExtraTreeClassifier
from sklearn.ensemble import RandomForestClassifier,ExtraTreesClassifier,BaggingClassifier,AdaBoostClassifier,GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

import numpy as np

train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 891 entries, 1 to 891
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   Survived    891 non-null    int64   
 1   Pclass      891 non-null    int64   
 2   Sex         891 non-null    int64   
 3   Age         891 non-null    category
 4   Fare        891 non-null    float64 
 5   Cabin       891 non-null    float64 
 6   Embarked    891 non-null    int64   
 7   Title       891 non-null    int64   
 8   FamilySize  891 non-null    float64 
dtypes: category(1), float64(3), int64(5)
memory usage: 63.7 KB

6.Cross Validation(k-fold)

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
k_fold = KFold(n_splits=10, shuffle=True, random_state=0)

clf = KNeighborsClassifier(n_neighbors = 13)
scoring = 'accuracy'
score = cross_val_score(clf, train_data, target, cv=k_fold, n_jobs=1, scoring=scoring)
print(score)

[0.7        0.75280899 0.7752809  0.70786517 0.76404494 0.74157303
 0.76404494 0.74157303 0.74157303 0.78651685]

#learning_rates = [0.05, 0.1, 0.25, 0.5, 0.75, 1]
clf = [KNeighborsClassifier(n_neighbors = 13),DecisionTreeClassifier(),
       RandomForestClassifier(n_estimators=13),GaussianNB(),SVC(),ExtraTreeClassifier(),
      GradientBoostingClassifier(n_estimators=10, learning_rate=1,max_features=3, max_depth =3, random_state = 10),AdaBoostClassifier(algorithm='SAMME'),ExtraTreesClassifier()]
def model_fit():
    scoring = 'accuracy'
    for i in range(len(clf)):
        score = cross_val_score(clf[i], train_data, target, cv=k_fold, n_jobs=1, scoring=scoring)
        print("Score of Model",i,":",round(np.mean(score)*100,2))
#     round(np.mean(score)*100,2)
#     print("Score of :\n",score)
model_fit()

Score of Model 0 : 74.75
Score of Model 1 : 79.8
Score of Model 2 : 80.03
Score of Model 3 : 79.46
Score of Model 4 : 67.45
Score of Model 5 : 78.57
Score of Model 6 : 81.93
Score of Model 7 : 81.59
Score of Model 8 : 79.24

clf1 = SVC()
clf1.fit(train_data, target)
test
test_data = test.drop(['Survived','PassengerId'], axis=1)
prediction = clf1.predict(test_data)
# test_data

test_data['Survived'] = prediction
submission = pd.DataFrame(test['PassengerId'],test_data['Survived'])
submission.to_csv("Submission.csv")

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.ipynb_checkpoints		.ipynb_checkpoints
input		input
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md
Submission.csv		Submission.csv

Folders and files

Latest commit

History

Repository files navigation

Data Dictionary

Total rows and columns

Data Visualization using Matplotlib and Seaborn packages

Bar Chat for Categorical Features

the chart confirms women more likely survived than men

the chart confirms 1st class more likely survived than othr calss

the chart confirms 3st class moer likely dead than othr calss

the chart confirms a person aboarded with more than 2 siblings or spouse more likely survived

the chart confirms a person aboarded without siblings or spouse more likely dead

The chart confirms a person aboarded with more than 2 parents or children more likely survived

The chart confirms a person aboarded alone more likely dead

The Chart confirms a person aboarded from C slightly more likely survived

The Chart confirms a person aboarded from Q more likely dead

The Chart confirms a person aboarded from S more likely dead

4. Feature engineering

Feature engineering is the process of using domain knowledge of the data to create features (feature vectors) that make machine learning algorithms work.

feature vector is an n-dimensional vector of numerical features that represent some object. Many algorithms in machine learning require a numerical representation of objects, since such representations facilitate processing and statistical analysis.

Title Map

Mr : 0 Miss : 1Mrs: 2Others: 3

Those who were 20 to 30 years old were more dead and more survived.

Binning

Binning/Converting Numerical Age to Categorical Variable

feature vector map:

more than 50 % of 1st class are from S embark.

more than 50 % of 2st class are from S embark.

more than 50 % of 3st class are from S embark.

fill out missing embark with S embark

family Size

5. Modelling

6.Cross Validation(k-fold)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Mr : 0
Miss : 1
Mrs: 2
Others: 3

Packages