Titanic Survival Prediction
Challenge:
We want you to use the Titanic passenger data (name, age, ticket price, etc.) to forecast who will survive
and who will perish.
Overview:
In this article, we will examine the trained and test data, which includes passenger-specific information.
With the information, we will determine who will survive.
Environment and Libraries:
We have implemented the project using Python 3 environment. The libraries used are
 NumPy: Used for performing mathematical operations in arrays.
 Pandas: For analyzing and reading the data.
 Matplotlib: For visualization of data in the form of graphs.
 Sklearn: To implement Machine learning techniques.
 Seaborn: For visualization of data.
Data:
A total of 3 different datasets have been provided for us to use, namely train.csv, test.csv and
gender_submission.csv. The files can be accessed using the following links:
 /kaggle/input/titanic/train.csv
 /kaggle/input/titanic/test.csv
 /kaggle/input/titanic/gender_submission.csv
The first step was to load the data from the above links. For this we are using the pandas library and read.csv c
ommand.
Loading the train.csv file
Loading the test.csv file
train.csv – This dataset provides the passenger information for a subset of passengers. They are a total of 891 r
ows each comprising to a passenger.
test.csv – This dataset contains the data for the other 418 passengers for whom we have to predict the survival.
gender_submission.csv – This is a sample file which explains us how to structure our predictions. This dataset
assumes all the male passengers dead and all the female passengers survived.
Exploration of Patterns:
First, we are checking to what degree the assumption made in the gender_submission.csv file, where all fe
males survive and male passenger don’t survive.
We can see in the above code; we have calculated the survival rate of the female passengers which came
out to be at 74.2%.
The above code calculates the survival rate of male passengers present on the Titanic.
Contribution:
So first here we have imported matplotlib and seaborn libraries which we are going to use later. Before go
ing further and predicting the survival rate, first we need to preprocess the data. So here we are first remo
ving the unnecessary columns such as PassengerID, Name, Ticket and Cabin, and then we fill the empty c
ells with the median value.
Loading the data using pandas
Removing the unneeded columns from the data
Filling the null column values with the data.
As seen in the data above, the column Embarked has categorical values ‘S’, ‘C’ and ‘Q’. Using the Label
Encoder we are transforming the data into the form of ‘1’ and ‘0’ by forming three new columns namely
Embarked_C, Embarked_Q and Embarked_S. The ‘Sex’ column has been also transformed from the cate
gorical values to ‘1’s and ‘0’s.
We are plotting the heatmap to see the relation between different columns.
In the next step, we are training the data to predict the survival of the passengers. So first we need to remo
ve the survival column from the data in order to train the data properly.
After this we are now reading and transforming the data present in the test.csv file. Here we are doing the
same steps which we have performed for the train.csv data.
After preprocessing the test data, we need to predict the survivability of each passenger. This is done by u
sing the XGBClassifier. XGBClassifier is one of the machine learning algorithms used for prediction. Thi
s can be applied to tabular and structured data.
Finally, we are predicting and saving the survival data and saving the predicted data in ‘submission_xgb.c
sv’ file.
Final Output:
The above is the final output obtained, submission_xgb.csv
References:
https://ithelp.ithome.com.tw/articles/10257683
https://stackoverflow.com/questions/60765425/how-to-install-latest-version-of-tensorflow-on-kaggle
https://notebook.community/minesh1291/Practicing-Kaggle/MNIST_2017/dump_/toxic_notebook