\n",
+ "Great!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.feature_selection import RFE\n",
+ "from sklearn.linear_model import LogisticRegression\n",
+ "\n",
+ "array_1 = Train.values\n",
+ "X = array_1[:,0:11]\n",
+ "Y = array_1[:,11]\n",
+ "# feature extraction\n",
+ "#array = Train.values\n",
+ "# separate array into input and output components\n",
+ "model = LogisticRegression()\n",
+ "rfe = RFE(model, 8)\n",
+ "fit = rfe.fit(rescaledX, Y)\n",
+ "print(\"Num Features: \", fit.n_features_)\n",
+ "print(\"Selected Features:\", fit.support_)\n",
+ "print(\"Feature Ranking: \", fit.ranking_)\n",
+ "np.random.seed(42)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Evaluating the performance of classification models\n",
+ "\n",
+ "## The purpose is to know how well an algorithm performs on unseen data and thus be able to predict more accurately on your required data. This prevents overfitting which would occur if the algorithm is trained on the same data that it will test leaving you with perfect scores that are unrealistic."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Split into train and test data\n",
+ "## This method separates your training data into a training and testing data set from which we can use a model and determine the accuracy of prediction. This method enables the use of the training set to build and train your model and test the accuracy of the model on the same dataset without using the actual data you want to predict on. So as to see how well it performs on any data with the highest accuracy. \n",
+ "\n",
+ "
\n",
+ "Great!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# KNeighborsClassifier model\n",
+ "## This classifier implements the k-nearest neighbors vote by assigning weights to the contributions of neighbours such that the nearer neighbours contribute more to the average than the distant ones"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.model_selection import train_test_split\n",
+ "from sklearn.neighbors import KNeighborsClassifier\n",
+ "\n",
+ "test_size = 0.33\n",
+ "seed = 7\n",
+ "X_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\n",
+ "random_state=seed)\n",
+ "model = KNeighborsClassifier()\n",
+ "model.fit(X_train, Y_train)\n",
+ "result = model.score(X_test, Y_test)\n",
+ "print(\"Accuracy: \", (result*100.0))\n",
+ "np.random.seed(42)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Logistic Regression\n",
+ "## Describes data and explains the relationship between one variable and the other"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.model_selection import train_test_split\n",
+ "from sklearn.linear_model import LogisticRegression\n",
+ "\n",
+ "test_size = 0.33\n",
+ "seed = 7\n",
+ "X_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\n",
+ "random_state=seed)\n",
+ "model = LogisticRegression()\n",
+ "model.fit(X_train, Y_train)\n",
+ "result = model.score(X_test, Y_test)\n",
+ "print(\"Accuracy: \", (result*100.0))\n",
+ "np.random.seed(42)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# GaussianNB\n",
+ "## Bases on applying Bayes’ theorem with the assumption of conditional independence between every pair of features given the value of the class variable. This model works well in most real world scenarios."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.model_selection import train_test_split\n",
+ "from sklearn.naive_bayes import GaussianNB\n",
+ "\n",
+ "test_size = 0.33\n",
+ "seed = 7\n",
+ "X_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\n",
+ "random_state=seed)\n",
+ "model = GaussianNB()\n",
+ "model.fit(X_train, Y_train)\n",
+ "result = model.score(X_test, Y_test)\n",
+ "print(\"Accuracy: \", (result*100.0))\n",
+ "np.random.seed(42)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Linear Discriminant Analysis\n",
+ "## This is a dimensionality reduction technique that reduces the dimensions while retaining as much information as possible"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.model_selection import train_test_split\n",
+ "from sklearn.discriminant_analysis import LinearDiscriminantAnalysis\n",
+ "\n",
+ "test_size = 0.33\n",
+ "seed = 7\n",
+ "X_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\n",
+ "random_state=seed)\n",
+ "model = LinearDiscriminantAnalysis()\n",
+ "model.fit(X_train, Y_train)\n",
+ "result = model.score(X_test, Y_test)\n",
+ "print(\"Accuracy: \", (result*100.0))\n",
+ "np.random.seed(42)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Quadratic Discrimination Analysis\n",
+ "\n",
+ "## A variation of LDA that is useful if there is prior knowledge that individual classes exhibit distinct covariance. QDA is less strict and allows for differing covariance for different classes. QDA is flexible and hence can lead to an improved prediction performance"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.model_selection import train_test_split\n",
+ "from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis\n",
+ "\n",
+ "test_size = 0.33\n",
+ "seed = 7\n",
+ "X_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\n",
+ "random_state=seed)\n",
+ "model = QuadraticDiscriminantAnalysis()\n",
+ "model.fit(X_train, Y_train)\n",
+ "result = model.score(X_test, Y_test)\n",
+ "print(\"Accuracy: \", (result*100.0))\n",
+ "np.random.seed(42)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Stochastic Gradient Descent\n",
+ "\n",
+ "## A classification method used to find values of the parameters of a function minimizing the cost as much as possible. Stochastic implies that the process is linked with random probability where a few samples are selected at random rather than the whole data set. This method is considered because it is computationally fast as it only works on one sample at a time. It is also converges faster for larger datasets as it causes updates to the parameters more frequently"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.model_selection import train_test_split\n",
+ "from sklearn.linear_model import SGDClassifier\n",
+ "\n",
+ "test_size = 0.33\n",
+ "seed = 7\n",
+ "X_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\n",
+ "random_state=seed)\n",
+ "model = SGDClassifier()\n",
+ "model.fit(X_train, Y_train)\n",
+ "result = model.score(X_test, Y_test)\n",
+ "print(\"Accuracy: \", (result*100.0))\n",
+ "np.random.seed(42)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Decision Tree classifier\n",
+ "## It is a predictive modeling approach that uses a decision tree to go from observations about an item to conclusions about the item's target value"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.model_selection import train_test_split\n",
+ "from sklearn.tree import DecisionTreeClassifier\n",
+ "\n",
+ "test_size = 0.33\n",
+ "seed = 7\n",
+ "X_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\n",
+ "random_state=seed)\n",
+ "model = DecisionTreeClassifier()\n",
+ "model.fit(X_train, Y_train)\n",
+ "result = model.score(X_test, Y_test)\n",
+ "print(\"Accuracy: \", (result*100.0))\n",
+ "np.random.seed(42)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Random Forest Classifier\n",
+ "\n",
+ "## This is a model that grows multiple trees and classifies objects based on votes votes of all the trees. It reduces the problem of overfitting or high bias"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.model_selection import train_test_split\n",
+ "from sklearn.ensemble import RandomForestClassifier\n",
+ "\n",
+ "test_size = 0.33\n",
+ "seed = 7\n",
+ "X_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\n",
+ "random_state=seed)\n",
+ "model = RandomForestClassifier()\n",
+ "model.fit(X_train, Y_train)\n",
+ "result = model.score(X_test, Y_test)\n",
+ "print(\"Accuracy: \", (result*100.0))\n",
+ "np.random.seed(42)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Support Vector Machine\n",
+ "## supervised machine learning model that uses classification algorithms for two-group classification problems"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.model_selection import train_test_split\n",
+ "from sklearn.svm import SVC\n",
+ "\n",
+ "test_size = 0.33\n",
+ "seed = 7\n",
+ "X_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\n",
+ "random_state=seed)\n",
+ "model = SVC()\n",
+ "model.fit(X_train, Y_train)\n",
+ "result = model.score(X_test, Y_test)\n",
+ "print(\"Accuracy: \", (result*100.0))\n",
+ "np.random.seed(42)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# MLPC Classifier\n",
+ "\n",
+ "## This is a multi-layer perceptron which utilizes supervized learning method called back propagation for training and can distinguish data that is not linearly seperable"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.model_selection import train_test_split\n",
+ "from sklearn.neural_network import MLPClassifier\n",
+ "\n",
+ "test_size = 0.33\n",
+ "seed = 7\n",
+ "X_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\n",
+ "random_state=seed)\n",
+ "model = MLPClassifier()\n",
+ "model.fit(X_train, Y_train)\n",
+ "result = model.score(X_test, Y_test)\n",
+ "print(\"Accuracy: \", (result*100.0))\n",
+ "np.random.seed(42)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Confusion matrix\n",
+ "## This is a table that is often used to describe the performance of a classification model on a set of test data for which the true values are known. It allows the visualization of the performance of an algorithm. The ideal matrix has the false positives and false negatives as 0 indication perfect performance. \n",
+ "\n",
+ "## I added the code to identify the Matthew's Correlation Coefficient (MCC) which is a measure of the quality of the classification and a more accurate representation of which model actually performs better."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## KNeighborsClassifier"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.metrics import confusion_matrix\n",
+ "from sklearn.metrics import matthews_corrcoef\n",
+ "\n",
+ "test_size = 0.33\n",
+ "seed = 7\n",
+ "X_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\n",
+ "random_state=seed)\n",
+ "\n",
+ "model = KNeighborsClassifier()\n",
+ "model.fit(X_train, Y_train)\n",
+ "\n",
+ "predicted = model.predict(X_test)\n",
+ "MCC = matthews_corrcoef(Y_test, predicted)\n",
+ "matrix = confusion_matrix(Y_test, predicted)\n",
+ "print(matrix)\n",
+ "print(MCC)\n",
+ "np.random.seed(42)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Logistic Regression"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.metrics import confusion_matrix\n",
+ "from sklearn.metrics import matthews_corrcoef\n",
+ "\n",
+ "test_size = 0.33\n",
+ "seed = 7\n",
+ "X_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\n",
+ "random_state=seed)\n",
+ "\n",
+ "model = LogisticRegression()\n",
+ "model.fit(X_train, Y_train)\n",
+ "\n",
+ "predicted = model.predict(X_test)\n",
+ "matrix = confusion_matrix(Y_test, predicted)\n",
+ "MCC = matthews_corrcoef(Y_test, predicted)\n",
+ "print(matrix)\n",
+ "print(MCC)\n",
+ "np.random.seed(42)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## GaussianNB"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.metrics import confusion_matrix\n",
+ "from sklearn.metrics import matthews_corrcoef\n",
+ "\n",
+ "test_size = 0.33\n",
+ "seed = 7\n",
+ "X_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\n",
+ "random_state=seed)\n",
+ "\n",
+ "model = GaussianNB()\n",
+ "model.fit(X_train, Y_train)\n",
+ "\n",
+ "predicted = model.predict(X_test)\n",
+ "matrix = confusion_matrix(Y_test, predicted)\n",
+ "MCC = matthews_corrcoef(Y_test, predicted)\n",
+ "print(matrix)\n",
+ "print(MCC)\n",
+ "np.random.seed(42)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Linear Discriminant Analysis"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.metrics import confusion_matrix\n",
+ "from sklearn.metrics import matthews_corrcoef\n",
+ "\n",
+ "test_size = 0.33\n",
+ "seed = 7\n",
+ "X_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\n",
+ "random_state=seed)\n",
+ "\n",
+ "model = LinearDiscriminantAnalysis()\n",
+ "model.fit(X_train, Y_train)\n",
+ "\n",
+ "predicted = model.predict(X_test)\n",
+ "matrix = confusion_matrix(Y_test, predicted)\n",
+ "MCC = matthews_corrcoef(Y_test, predicted)\n",
+ "print(matrix)\n",
+ "print(MCC)\n",
+ "np.random.seed(42)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Quadratic Discriminant Analysis"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.metrics import confusion_matrix\n",
+ "from sklearn.metrics import matthews_corrcoef\n",
+ "\n",
+ "test_size = 0.33\n",
+ "seed = 7\n",
+ "X_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\n",
+ "random_state=seed)\n",
+ "\n",
+ "model = QuadraticDiscriminantAnalysis()\n",
+ "model.fit(X_train, Y_train)\n",
+ "\n",
+ "predicted = model.predict(X_test)\n",
+ "matrix = confusion_matrix(Y_test, predicted)\n",
+ "MCC = matthews_corrcoef(Y_test, predicted)\n",
+ "print(matrix)\n",
+ "print(MCC)\n",
+ "np.random.seed(42)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Stochastic Gradient Descent"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.metrics import confusion_matrix\n",
+ "from sklearn.metrics import matthews_corrcoef\n",
+ "\n",
+ "test_size = 0.33\n",
+ "seed = 7\n",
+ "X_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\n",
+ "random_state=seed)\n",
+ "\n",
+ "model = SGDClassifier()\n",
+ "model.fit(X_train, Y_train)\n",
+ "\n",
+ "predicted = model.predict(X_test)\n",
+ "matrix = confusion_matrix(Y_test, predicted)\n",
+ "MCC = matthews_corrcoef(Y_test, predicted)\n",
+ "print(matrix)\n",
+ "print(MCC)\n",
+ "np.random.seed(42)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Decision Tree Classifier"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.metrics import confusion_matrix\n",
+ "from sklearn.metrics import matthews_corrcoef\n",
+ "\n",
+ "test_size = 0.33\n",
+ "seed = 7\n",
+ "X_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\n",
+ "random_state=seed)\n",
+ "\n",
+ "model = DecisionTreeClassifier()\n",
+ "model.fit(X_train, Y_train)\n",
+ "\n",
+ "predicted = model.predict(X_test)\n",
+ "matrix = confusion_matrix(Y_test, predicted)\n",
+ "MCC = matthews_corrcoef(Y_test, predicted)\n",
+ "print(matrix)\n",
+ "print(MCC)\n",
+ "np.random.seed(42)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Random Forest Classifier"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.metrics import confusion_matrix\n",
+ "from sklearn.metrics import matthews_corrcoef\n",
+ "\n",
+ "test_size = 0.33\n",
+ "seed = 7\n",
+ "X_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\n",
+ "random_state=seed)\n",
+ "\n",
+ "model = RandomForestClassifier()\n",
+ "model.fit(X_train, Y_train)\n",
+ "\n",
+ "predicted = model.predict(X_test)\n",
+ "matrix = confusion_matrix(Y_test, predicted)\n",
+ "MCC = matthews_corrcoef(Y_test, predicted)\n",
+ "print(matrix)\n",
+ "print(MCC)\n",
+ "np.random.seed(42)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Support Vector Manchine"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.metrics import confusion_matrix\n",
+ "from sklearn.metrics import matthews_corrcoef\n",
+ "\n",
+ "\n",
+ "test_size = 0.33\n",
+ "seed = 7\n",
+ "X_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\n",
+ "random_state=seed)\n",
+ "\n",
+ "model = SVC()\n",
+ "model.fit(X_train, Y_train)\n",
+ "\n",
+ "predicted = model.predict(X_test)\n",
+ "matrix = confusion_matrix(Y_test, predicted)\n",
+ "MCC = matthews_corrcoef(Y_test, predicted)\n",
+ "print(matrix)\n",
+ "print(MCC)\n",
+ "np.random.seed(42)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.metrics import confusion_matrix\n",
+ "from sklearn.metrics import matthews_corrcoef\n",
+ "\n",
+ "\n",
+ "test_size = 0.33\n",
+ "seed = 7\n",
+ "X_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\n",
+ "random_state=seed)\n",
+ "\n",
+ "model = MLPClassifier()\n",
+ "model.fit(X_train, Y_train)\n",
+ "\n",
+ "predicted = model.predict(X_test)\n",
+ "matrix = confusion_matrix(Y_test, predicted)\n",
+ "MCC = matthews_corrcoef(Y_test, predicted)\n",
+ "print(matrix)\n",
+ "print(MCC)\n",
+ "np.random.seed(42)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Classification Report\n",
+ "\n",
+ "## This is a convenient report that provides precision, recall, f1- score and support for each class. This helps to provide a quick idea of the accuracy of the model \n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## KNeighborsClassifier"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.metrics import classification_report\n",
+ "\n",
+ "test_size = 0.33\n",
+ "seed = 7\n",
+ "X_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\n",
+ "random_state=seed)\n",
+ "model = KNeighborsClassifier()\n",
+ "model.fit(X_train, Y_train)\n",
+ "predicted = model.predict(X_test)\n",
+ "report = classification_report(Y_test, predicted)\n",
+ "print(report)\n",
+ "np.random.seed(42)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Logistic Regression"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.metrics import classification_report\n",
+ "\n",
+ "test_size = 0.33\n",
+ "seed = 7\n",
+ "X_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\n",
+ "random_state=seed)\n",
+ "model = LogisticRegression()\n",
+ "model.fit(X_train, Y_train)\n",
+ "predicted = model.predict(X_test)\n",
+ "report = classification_report(Y_test, predicted)\n",
+ "print(report)\n",
+ "np.random.seed(42)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# GaussianNB"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.metrics import classification_report\n",
+ "\n",
+ "test_size = 0.33\n",
+ "seed = 7\n",
+ "X_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\n",
+ "random_state=seed)\n",
+ "model = GaussianNB()\n",
+ "model.fit(X_train, Y_train)\n",
+ "predicted = model.predict(X_test)\n",
+ "report = classification_report(Y_test, predicted)\n",
+ "print(report)\n",
+ "np.random.seed(42)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Linear Discriminant Analysis"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.metrics import classification_report\n",
+ "\n",
+ "test_size = 0.33\n",
+ "seed = 7\n",
+ "X_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\n",
+ "random_state=seed)\n",
+ "model = LinearDiscriminantAnalysis()\n",
+ "model.fit(X_train, Y_train)\n",
+ "predicted = model.predict(X_test)\n",
+ "report = classification_report(Y_test, predicted)\n",
+ "print(report)\n",
+ "np.random.seed(42)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Quadratic Discriminant Analysis"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.metrics import classification_report\n",
+ "\n",
+ "test_size = 0.33\n",
+ "seed = 7\n",
+ "X_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\n",
+ "random_state=seed)\n",
+ "model = QuadraticDiscriminantAnalysis()\n",
+ "model.fit(X_train, Y_train)\n",
+ "predicted = model.predict(X_test)\n",
+ "report = classification_report(Y_test, predicted)\n",
+ "print(report)\n",
+ "np.random.seed(42)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Stochastic Gradient Descent"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.metrics import classification_report\n",
+ "\n",
+ "test_size = 0.33\n",
+ "seed = 7\n",
+ "X_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\n",
+ "random_state=seed)\n",
+ "model = SGDClassifier()\n",
+ "model.fit(X_train, Y_train)\n",
+ "predicted = model.predict(X_test)\n",
+ "report = classification_report(Y_test, predicted)\n",
+ "print(report)\n",
+ "np.random.seed(42)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Decision Tree Classifier"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.metrics import classification_report\n",
+ "\n",
+ "test_size = 0.33\n",
+ "seed = 7\n",
+ "X_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\n",
+ "random_state=seed)\n",
+ "model = DecisionTreeClassifier()\n",
+ "model.fit(X_train, Y_train)\n",
+ "predicted = model.predict(X_test)\n",
+ "report = classification_report(Y_test, predicted)\n",
+ "print(report)\n",
+ "np.random.seed(42)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Random Forest Classifier"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.metrics import classification_report\n",
+ "\n",
+ "test_size = 0.33\n",
+ "seed = 7\n",
+ "X_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\n",
+ "random_state=seed)\n",
+ "model = RandomForestClassifier()\n",
+ "model.fit(X_train, Y_train)\n",
+ "predicted = model.predict(X_test)\n",
+ "report = classification_report(Y_test, predicted)\n",
+ "print(report)\n",
+ "np.random.seed(42)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Support Vector Machine"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.metrics import classification_report\n",
+ "\n",
+ "test_size = 0.33\n",
+ "seed = 7\n",
+ "X_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\n",
+ "random_state=seed)\n",
+ "model = SVC()\n",
+ "model.fit(X_train, Y_train)\n",
+ "predicted = model.predict(X_test)\n",
+ "report = classification_report(Y_test, predicted)\n",
+ "print(report)\n",
+ "np.random.seed(42)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.metrics import classification_report\n",
+ "\n",
+ "test_size = 0.33\n",
+ "seed = 7\n",
+ "X_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\n",
+ "random_state=seed)\n",
+ "model = MLPClassifier()\n",
+ "model.fit(X_train, Y_train)\n",
+ "predicted = model.predict(X_test)\n",
+ "report = classification_report(Y_test, predicted)\n",
+ "print(report)\n",
+ "np.random.seed(42)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# For downstream work,we have to rescale the test dataset so that the predictions can fit on the test data and be accurate since the training data set was rescaled"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "array2 = Test.values\n",
+ "# separate array into input and output components\n",
+ "X1 = array2[:,0:10]\n",
+ "Y1 = array2[:,10]\n",
+ "scaler = MinMaxScaler(feature_range=(0, 1))\n",
+ "rescaledX1 = scaler.fit_transform(X1)\n",
+ "# summarize transformed data\n",
+ "set_printoptions(precision=3)\n",
+ "print(rescaledX1[0:5,:])\n",
+ "np.random.seed(42)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# The following code compares the algorithms consistently outputting accuracy scores and MCC.\n",
+ "## This code uses K-fold cross validation method to evaluate the model. This method bases on randomly partitioning data into k equal sized subsamples with a single subsample retained as the validation data for testing the model, and the remaining are used as training data. The process is then repeated k times, with each of the k subsamples used exactly once as the validation data. The k results can then be averaged to produce a single estimation. This method tests the model's ability to predict new data that was not used in estimating it, in order to flag problems like overfitting. \n",
+ "### I also added MCC to the code for better interpretation"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from pandas import read_csv\n",
+ "from matplotlib import pyplot\n",
+ "from sklearn.model_selection import KFold\n",
+ "from sklearn.model_selection import cross_val_score\n",
+ "from sklearn.linear_model import LogisticRegression\n",
+ "from sklearn.tree import DecisionTreeClassifier\n",
+ "from sklearn.ensemble import RandomForestClassifier\n",
+ "from sklearn.neighbors import KNeighborsClassifier\n",
+ "from sklearn.discriminant_analysis import LinearDiscriminantAnalysis\n",
+ "from sklearn.naive_bayes import GaussianNB\n",
+ "from sklearn.svm import SVC\n",
+ "from sklearn.linear_model import SGDClassifier\n",
+ "from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis\n",
+ "from sklearn.neural_network import MLPClassifier\n",
+ "from sklearn.metrics import matthews_corrcoef\n",
+ "\n",
+ "array = Train.values \n",
+ "X = array[:,0:11]\n",
+ "Y = array[:,11]\n",
+ "\n",
+ "# prepare models and add them to a list\n",
+ "models = []\n",
+ "models.append(('LR', LogisticRegression()))\n",
+ "models.append(('LDA', LinearDiscriminantAnalysis()))\n",
+ "models.append(('KNN', KNeighborsClassifier()))\n",
+ "models.append(('CART', DecisionTreeClassifier()))\n",
+ "models.append(('NB', GaussianNB()))\n",
+ "models.append(('SVM', SVC()))\n",
+ "models.append(('RFC', RandomForestClassifier()))\n",
+ "models.append(('SDG', SGDClassifier()))\n",
+ "models.append(('QDA', QuadraticDiscriminantAnalysis()))\n",
+ "models.append(('MLPC', MLPClassifier()))\n",
+ "\n",
+ "# evaluate each model in turn\n",
+ "results = []\n",
+ "names = []\n",
+ "scoring = 'accuracy'\n",
+ "\n",
+ "for name, model in models:\n",
+ " kfold = KFold(n_splits=30, random_state=14)\n",
+ " cv_results = cross_val_score(model, rescaledX[:,(0,1,2,3,4,5,6,7)], Y, cv=kfold, scoring=scoring)\n",
+ " results.append(cv_results)\n",
+ " names.append(name)\n",
+ " msg = (name, cv_results.mean(), cv_results.std())\n",
+ " print(msg)\n",
+ " model.fit(rescaledX[:,(0,1,2,3,4,5,6,7)], Y)\n",
+ " predicted = model.predict(rescaledX[:,(0,1,2,3,4,5,6,7)])\n",
+ " MCC = matthews_corrcoef(Y, predicted)\n",
+ " print(MCC)\n",
+ " \n",
+ "\n",
+ "# boxplot algorithm comparison\n",
+ "fig = pyplot.figure()\n",
+ "fig.suptitle('Algorithm Comparison')\n",
+ "ax = fig.add_subplot(111)\n",
+ "pyplot.boxplot(results)\n",
+ "ax.set_xticklabels(names)\n",
+ "pyplot.show()\n",
+ "np.random.seed(42)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# From the above evaluations;\n",
+ "## From the Split test and train data, KNeighborsClassifier and RandomForestClassifier had the highest accuracy and MCC. While from the Kfold cross validation data, GaussianNB and Support Vector Machine had the highest accuracy while Decision Tree Classifier and RandomForestClassifier had the highest MCC both equating to 1.0.\n",
+ "\n",
+ "## In order to determine the most ideal model for prediction, i run the prediction algorithms using each of the 4 mentioned models and submitted the csv files in order to compare the scores"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# KNeighborsClassifier"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "model = KNeighborsClassifier()\n",
+ "model.fit(rescaledX[:,(0,1,2,3,4,5,6,7)], Y)\n",
+ "model.predict(rescaledX1[:,(0,1,2,3,4,5,6,7)])\n",
+ "prediction = model.predict(rescaledX1[:,(0,1,2,3,4,5,6,7)])\n",
+ "test_pred = pd.DataFrame(prediction)\n",
+ "test_pred.columns = [\"CLASS\"]\n",
+ "test_pred.index.name = \"Index\"\n",
+ "test_pred['CLASS']= test_pred['CLASS'].map({0.0:False,1.0:True})\n",
+ "\n",
+ "test_pred.to_csv(\"test_pred.csv\")\n",
+ "print(test_pred['CLASS'].unique())\n",
+ "print(test_pred.groupby('CLASS').size()[0].sum())\n",
+ "print (test_pred.groupby('CLASS').size()[1].sum())\n",
+ "test_pred"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# RandomForestClassifier"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "model = RandomForestClassifier()\n",
+ "model.fit(rescaledX[:,(0,1,2,3,4,5,6,7)],Y)\n",
+ "model.predict(rescaledX1[:,(0,1,2,3,4,5,6,7)])\n",
+ "prediction = model.predict(rescaledX1[:,(0,1,2,3,4,5,6,7)])\n",
+ "test_pred = pd.DataFrame(prediction)\n",
+ "test_pred.columns = [\"CLASS\"]\n",
+ "test_pred.index.name = \"Index\"\n",
+ "test_pred['CLASS']= test_pred['CLASS'].map({0.0:False,1.0:True})\n",
+ "\n",
+ "test_pred.to_csv(\"test_pred.csv\")\n",
+ "print(test_pred['CLASS'].unique())\n",
+ "print(test_pred.groupby('CLASS').size()[0].sum())\n",
+ "print (test_pred.groupby('CLASS').size()[1].sum())\n",
+ "test_pred"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# GaussianNB"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "model = GaussianNB()\n",
+ "model.fit(rescaledX[:,(0,1,2,3,4,5,6,7)],Y)\n",
+ "model.predict(rescaledX1[:,(0,1,2,3,4,5,6,7)])\n",
+ "prediction = model.predict(rescaledX1[:,(0,1,2,3,4,5,6,7)])\n",
+ "test_pred = pd.DataFrame(prediction)\n",
+ "test_pred.columns = [\"CLASS\"]\n",
+ "test_pred.index.name = \"Index\"\n",
+ "test_pred['CLASS']= test_pred['CLASS'].map({0.0:False,1.0:True})\n",
+ "\n",
+ "test_pred.to_csv(\"test_pred.csv\")\n",
+ "print(test_pred['CLASS'].unique())\n",
+ "print(test_pred.groupby('CLASS').size()[0].sum())\n",
+ "print (test_pred.groupby('CLASS').size()[1].sum())\n",
+ "test_pred"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Support Vector Machine"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "model = SVC()\n",
+ "model.fit(rescaledX[:,(0,1,2,3,4,5,6,7)], Y)\n",
+ "model.predict(rescaledX1[:,(0,1,2,3,4,5,6,7)])\n",
+ "prediction = model.predict(rescaledX1[:,(0,1,2,3,4,5,6,7)])\n",
+ "test_pred = pd.DataFrame(prediction)\n",
+ "test_pred.columns = [\"CLASS\"]\n",
+ "test_pred.index.name = \"Index\"\n",
+ "test_pred['CLASS']= test_pred['CLASS'].map({0.0:False,1.0:True})\n",
+ "\n",
+ "test_pred.to_csv(\"test_pred.csv\")\n",
+ "print(test_pred['CLASS'].unique())\n",
+ "print(test_pred.groupby('CLASS').size()[0].sum())\n",
+ "print (test_pred.groupby('CLASS').size()[1].sum())\n",
+ "test_pred"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# I discovered that the best submission score came from the GaussianNB that had the highest accuracy score in the Kfold cross validation and a moderate MCC compared to the other models. The KNeighborsClassifier and Random Forest Classifier models performed more poorly than the GaussianNB even though they had higher accuracy and MCC from split train and test evaluation\n",
+ "\n",
+ "# From this i concluded that the kfold cross validation method of evaluating machine learning algorithm performance is much more powerful and accurate than splitting into test and train data sets and therefore chose GaussianNB as my prediction model\n",
+ "\n",
+ "# I decicded to also test whether transforming the dataset by rescaling actually affected the model's functionality by re-evaluating using kfold cross validation to see whether there would be any change in the accuracy or MCC of the model chosen to be the most optimal\n",
+ "\n",
+ "
\n",
+ "Great! Improvements"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from pandas import read_csv\n",
+ "from matplotlib import pyplot\n",
+ "from sklearn.model_selection import KFold\n",
+ "from sklearn.model_selection import cross_val_score\n",
+ "from sklearn.linear_model import LogisticRegression\n",
+ "from sklearn.tree import DecisionTreeClassifier\n",
+ "from sklearn.ensemble import RandomForestClassifier\n",
+ "from sklearn.neighbors import KNeighborsClassifier\n",
+ "from sklearn.discriminant_analysis import LinearDiscriminantAnalysis\n",
+ "from sklearn.naive_bayes import GaussianNB\n",
+ "from sklearn.svm import SVC\n",
+ "from sklearn.linear_model import SGDClassifier\n",
+ "from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis\n",
+ "from sklearn.neural_network import MLPClassifier\n",
+ "from sklearn.metrics import matthews_corrcoef\n",
+ "\n",
+ "array = Train.values \n",
+ "X = array[:,0:11]\n",
+ "Y = array[:,11]\n",
+ "\n",
+ "# prepare models and add them to a list\n",
+ "models = []\n",
+ "models.append(('LR', LogisticRegression()))\n",
+ "models.append(('LDA', LinearDiscriminantAnalysis()))\n",
+ "models.append(('KNN', KNeighborsClassifier()))\n",
+ "models.append(('CART', DecisionTreeClassifier()))\n",
+ "models.append(('NB', GaussianNB()))\n",
+ "models.append(('SVM', SVC()))\n",
+ "models.append(('RFC', RandomForestClassifier()))\n",
+ "models.append(('SDG', SGDClassifier()))\n",
+ "models.append(('QDA', QuadraticDiscriminantAnalysis()))\n",
+ "models.append(('MLPC', MLPClassifier()))\n",
+ "\n",
+ "# evaluate each model in turn\n",
+ "results = []\n",
+ "names = []\n",
+ "scoring = 'accuracy'\n",
+ "\n",
+ "for name, model in models:\n",
+ " kfold = KFold(n_splits=30, random_state=14)\n",
+ " cv_results = cross_val_score(model, X[:,(0,1,2,3,4,5,6,7)], Y, cv=kfold, scoring=scoring)\n",
+ " results.append(cv_results)\n",
+ " names.append(name)\n",
+ " msg = (name, cv_results.mean(), cv_results.std())\n",
+ " print(msg)\n",
+ " model.fit(X[:,(0,1,2,3,4,5,6,7)], Y)\n",
+ " predicted = model.predict(X[:,(0,1,2,3,4,5,6,7)])\n",
+ " MCC = matthews_corrcoef(Y, predicted)\n",
+ " print(MCC)\n",
+ " \n",
+ "\n",
+ "# boxplot algorithm comparison\n",
+ "fig = pyplot.figure()\n",
+ "fig.suptitle('Algorithm Comparison')\n",
+ "ax = fig.add_subplot(111)\n",
+ "pyplot.boxplot(results)\n",
+ "ax.set_xticklabels(names)\n",
+ "pyplot.show()\n",
+ "np.random.seed(42)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# It was discovered that evaluating non-rescaled data didn't change GaussianNB's accuracy or MCC, as well as most of the models. BUt it did change for the following;\n",
+ "## - KNeighborsClassifier; where the accuracy and MCC both reduced\n",
+ "## - DecisionTreeClassifier; where the accuracy reduced\n",
+ "## - RandomForestClassifier; where accuracy reduced\n",
+ "## - Support Vector Machine; where the accuracy and MCC both reduced\n",
+ "## - Stochastic Gradient Descent; where the accuracy and MCC both reduced\n",
+ "## - MLPClassifier; where the MCC increased\n",
+ "\n",
+ "# This prompted me to make a prediction using also the non-rescaled data to see if there would be a change in the score submission."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "model = GaussianNB()\n",
+ "model.fit(X[:,(0,1,2,3,4,5,6,7)],Y)\n",
+ "model.predict(X1[:,(0,1,2,3,4,5,6,7)])\n",
+ "prediction = model.predict(X1[:,(0,1,2,3,4,5,6,7)])\n",
+ "test_pred = pd.DataFrame(prediction)\n",
+ "test_pred.columns = [\"CLASS\"]\n",
+ "test_pred.index.name = \"Index\"\n",
+ "test_pred['CLASS']= test_pred['CLASS'].map({0.0:False,1.0:True})\n",
+ "\n",
+ "test_pred.to_csv(\"test_pred.csv\")\n",
+ "print(test_pred['CLASS'].unique())\n",
+ "print(test_pred.groupby('CLASS').size()[0].sum())\n",
+ "print (test_pred.groupby('CLASS').size()[1].sum())\n",
+ "test_pred"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# From this prediction, the submission score was found to be slightly higher than the rescaled data.\n",
+ "## It was also discovered that the higher the number of features selected the higher the accuracy of the models and the higher the score after submission.\n",
+ "## The final testing was to see whether the score would improve if all features were considered and the prediction was done on all the data\n",
+ "
\n",
+ "Great!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "model = GaussianNB()\n",
+ "model.fit(X[:,0:10],Y)\n",
+ "model.predict(X1)\n",
+ "prediction = model.predict(X1)\n",
+ "test_pred = pd.DataFrame(prediction)\n",
+ "test_pred.columns = [\"CLASS\"]\n",
+ "test_pred.index.name = \"Index\"\n",
+ "test_pred['CLASS']= test_pred['CLASS'].map({0.0:False,1.0:True})\n",
+ "\n",
+ "test_pred.to_csv(\"test_pred.csv\")\n",
+ "print(test_pred['CLASS'].unique())\n",
+ "print(test_pred.groupby('CLASS').size()[0].sum())\n",
+ "print (test_pred.groupby('CLASS').size()[1].sum())\n",
+ "test_pred"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# It was discovered that the highest score was obtained from the set which was predicted without features selected."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# In conclusion, Kfold cross validation is a better tool for evaluating the performance of a machine learning algorithm than splitting the data into test and training sets. It was also noted that accuracy or MCC alone are not efficient measures of model functionality as we saw the models with the highest accuracy in split test and train data did not perform best upon submission. The best model for this data set is the GaussianNB model which coincides with previous claims that it works best for real world data. An MCC OF 1.0 indicates a perfect agreement between actuals and predictions. Thus we would expect that KNeighborsClassifier and RandomForestClassifier would perform the best at prediction. However we see that they did not perform as expected which can be put down to a high rate of false positives and false negatives being generated.\n",
+ "# Therefore we need to consider several factors such as precision and F1-score before we conclude which model is best."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Literature:\n",
+ "## -wikipedia\n",
+ "## -class notes\n",
+ "## -https://scikit-learn.org/stable/supervised_learning.html#supervised-learning\n",
+ "## -https://medium.com/datadriveninvestor/classification-algorithms-in-machine-learning-85c0ab65ff4\n",
+ "## -https://www.geeksforgeeks.org/ml-stochastic-gradient-descent-sgd/\n",
+ "## -https://data-flair.training/blogs/machine-learning-classification-algorithms/\n",
+ "## -https://www.datascienceblog.net/post/machine-learning/linear-discriminant-analysis/\n",
+ "## -https://towardsdatascience.com/data-visualization-for-machine-learning-and-data-science-a45178970be7\n",
+ "## -https://seaborn.pydata.org/generated/seaborn.distplot.html\n",
+ "## -https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-019-6413-7"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.7.4"
+ },
+ "varInspector": {
+ "cols": {
+ "lenName": 16,
+ "lenType": 16,
+ "lenVar": 40
+ },
+ "kernels_config": {
+ "python": {
+ "delete_cmd_postfix": "",
+ "delete_cmd_prefix": "del ",
+ "library": "var_list.py",
+ "varRefreshCmd": "print(var_dic_list())"
+ },
+ "r": {
+ "delete_cmd_postfix": ") ",
+ "delete_cmd_prefix": "rm(",
+ "library": "var_list.r",
+ "varRefreshCmd": "cat(var_dic_list()) "
+ }
+ },
+ "types_to_exclude": [
+ "module",
+ "function",
+ "builtin_function_or_method",
+ "instance",
+ "_Feature"
+ ],
+ "window_display": false
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
From 6a2bd9f3f08aafbe885a7eba270cdf25c0b58980 Mon Sep 17 00:00:00 2001
From: Tindi-Kester-Bevin-Bataringaya
<61576802+Tindi-Kester-Bevin-Bataringaya@users.noreply.github.com>
Date: Mon, 16 Mar 2020 15:07:21 +0300
Subject: [PATCH 5/5] Add files via upload
---
Assignment Colab/Tindi Kester Bevin 3.ipynb | 1 +
1 file changed, 1 insertion(+)
create mode 100644 Assignment Colab/Tindi Kester Bevin 3.ipynb
diff --git a/Assignment Colab/Tindi Kester Bevin 3.ipynb b/Assignment Colab/Tindi Kester Bevin 3.ipynb
new file mode 100644
index 0000000..be071b3
--- /dev/null
+++ b/Assignment Colab/Tindi Kester Bevin 3.ipynb
@@ -0,0 +1 @@
+{"cells":[{"metadata":{},"cell_type":"markdown","source":"# STUDENT; Tindi Kester Bevin Bataringaya\n# Registration number; (2019/HD07/24822U)"},{"metadata":{"_uuid":"8f2839f25d086af736a60e9eeb907d3b93b6e0e5","_cell_guid":"b1076dfc-b9ad-4769-8c92-a6c4dae69d19","trusted":true},"cell_type":"code","source":"# This Python 3 environment comes with many helpful analytics libraries installed\n# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python\n# For example, here's several helpful packages to load in \n\nimport numpy as np # linear algebra\nimport pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n\n# Input data files are available in the \"../input/\" directory.\n# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory\n\nimport os\nfor dirname, _, filenames in os.walk('/kaggle/input'):\n for filename in filenames:\n print(os.path.join(dirname, filename))\n\n# Any results you write to the current directory are saved as output.","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"#import the necessary libraries you are going to use\nimport warnings\nwarnings.filterwarnings('ignore')\n\n# -----> Put your code here below:\n\n\nimport pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport seaborn as sns","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Now loading the datasets"},{"metadata":{"trusted":true},"cell_type":"code","source":"Train = pd.read_csv(\"/kaggle/input/amp-data-set/AMP_TrainSet.csv\")\nTest = pd.read_csv(\"/kaggle/input/amp-data-set/Test.csv\")\n\n#the code loads the datasets into the environment haveing specified the path from which the datasets are to be pulled","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## At this point we are carrying out exploratory data analysis(EDA) which summarizes the main characteristics in the data set. It is an approach to analyzing data sets to summarize their main characteristics often with visuals. It's purpose is to suggest hypotheses about the causes of observed phenomena by assessing assumption son which statistical inferences will be based. It provides the basis for further data collection. In order to draw reliable conclusions from massive amounts of data, we must carefully and methodically look through the data which is the reason for EDA"},{"metadata":{},"cell_type":"markdown","source":"# Checking dimensions of the datasets\n\n## We do this to identify how many features we are dealing with at this stage of EDA in order to prepare better for the next steps"},{"metadata":{"trusted":true},"cell_type":"code","source":"# check the dimensions of your data\n\nTrain.shape, Test.shape\n\n#This command enables us to know the dimensions of our datasets which are basically the rows and columns that are contained in them.","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## The Train dataset has 3038 rows and 12 columns while the test dataset has 758 rows and 11 columns "},{"metadata":{},"cell_type":"markdown","source":"# **Checking out the datasets**\n## The purpose of this command is to check whether we loaded the dataset properly or to be sure that it is the correct dataset"},{"metadata":{"trusted":true},"cell_type":"code","source":"Train.head(10)\n#checks the first ten rows of the training dataset to view what kind of data to expect","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"Test.head(10)\n#checks the first ten rows of the test dataset to view what kind of data to expect","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## The command showed us what data we are dealing with in the datasets"},{"metadata":{},"cell_type":"markdown","source":"## We need to determine the data type we are working with in the dataset. This is important because most machine learning algorithms work with numerical data. The purpose of this is to identify whether we have numerical data(integers or floats) or categorical data that needs further coding to be manipulated in the algorithm"},{"metadata":{"trusted":true},"cell_type":"code","source":"Train.dtypes, Test.dtypes\n#the code checks to see the type of data that we have in each dataset","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## From the above code we found that in both datasets, the only data types are floats and integers"},{"metadata":{"trusted":true},"cell_type":"code","source":"Train.isnull().sum(), Test.isnull().sum()\n#the code above checks whethere there is any missing value in the dataset","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"\n\n## From the code above we found that there is no missing data in any of the datasets"},{"metadata":{},"cell_type":"markdown","source":"# **Descriptive statistics**\n\n## These quantitatively describe features of a dataset taht we aim to summarize, organize and clean. They are used to describe the data before feeding it into the machine learning model using features and sample sets."},{"metadata":{"trusted":true},"cell_type":"code","source":"Train.describe()\n\n#code to provide the descriptive statistcis of the dataset","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"Test.describe()\n\n#code to provide the descriptive statistcis of the dataset","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### The above code showed us the descriptive statistics of the datasets which will be useful going forward. From these statistics we learned that some values in the different variables are much higher than others such as FULL_DAYM780201 has a maximum value of 102.929000 while FULL_GEOR030101 has a maximum value of 1.182000. FULL_OOBM850104 has negative values and even has a negative mean while all other variables have positive values. These descriptive statistics indicate that there may be a need for the data to be transformed possibly by rescaling."},{"metadata":{},"cell_type":"markdown","source":"## It is important that the CLASS variable is balanced beacuse algorithms tend to favor the class with the largest proportion of observations which may lead to misleading aaccuracies especially if the classes are rare."},{"metadata":{"trusted":true},"cell_type":"code","source":"Train.groupby('CLASS').size().plot(kind='bar')\n#code indicates whether the categorical values are imbalanced or not ","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## From the plot we can see that distribution in the variable CLASS is balanced"},{"metadata":{},"cell_type":"markdown","source":"# **Determining correlation between variables**\n\n## Correlation explains the extent of the relationship between the features of the data. This is important in case one feature has a relationship with another and could provide valuable information for making sense of another during the prediction stage"},{"metadata":{"trusted":true},"cell_type":"code","source":"Train.corr(method='pearson')","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"plt.figure(figsize=(6,6))\nsns.heatmap(Train.corr(method='pearson'))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"Train.corr(method= 'pearson')['CLASS']","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## The plot below is of histograms of the dataset and this shows us the distribution of each variable in the dataset\n\n## Histograms summarize and display the distribution of the variables in the dataset, identify skewness of data and if need be identifies which features should be modified before inputting in the model"},{"metadata":{"trusted":true},"cell_type":"code","source":"plt.figure(figsize=(24,24))\nTrain.hist()\nplt.show()","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## The plots below show presence or absence of outliers in each variable\n\n## Box plots provide a standardized way of displaying distribution of data in terms of min,max,upper quartile,lower quartile and median. Outliers are numerically distant from the rest of the data. They may contain valuable information or not but they tend to skew data away from a normal distribution"},{"metadata":{"trusted":true},"cell_type":"code","source":"Train.plot(kind='box', subplots=True, layout=(6,2), sharex=False, sharey=False)\nplt.show()","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## The plot below shows us the distribution of each variable compared with one another\n\n## The scatter plots illustrate a relationship between each variable with each other which can be positive or negative"},{"metadata":{"trusted":true},"cell_type":"code","source":"sns.pairplot(Train)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# **Distribution plot**\n\n## A univariate plot to know about the distribution of data when analyzing effect on dependent variable with respective to a single feature"},{"metadata":{"trusted":true},"cell_type":"code","source":"sns.FacetGrid(Train,size=11).map(sns.distplot,'CLASS').add_legend()","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# **Violin plot**\n\n## This is a combination of a Box plot at the middle and distribution plots on both side of the data which gives us the details of distribution. This plot is used to visualizeddistribution of the data and it's probability density. A violin plot contains all data points and is an excellent tool to visualize samples."},{"metadata":{"trusted":true},"cell_type":"code","source":"sns.violinplot(x='CLASS',y='NT_EFC195',data=Train)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## From the above plot we see that distribution of NT_EFC195 around CLASS 0 has data skewed to the left with a small cluster of large data to the right. Distribution around CLASS 1 appears more normal though there is also a cluster to the right. These clusters are pulling the means in each class higher. The median represented by the white circle at the centre. The median is higher in CLASS 0 than CLASS 1 for NT_EFC195"},{"metadata":{},"cell_type":"markdown","source":"# **Classification**\n\n## From the statistical descriptions we saw that some variables had negative values and thus i decided to work the data for both rescaled data and the same data to see if these values will affect our final outcome"},{"metadata":{},"cell_type":"markdown","source":"# **Rescaling the data**\n\n## As the data scales vary in the dataset, rescaling is useful for optimization algorithms"},{"metadata":{"trusted":true},"cell_type":"code","source":"from numpy import set_printoptions\nfrom sklearn.preprocessing import MinMaxScaler\n\narray = Train.values\n#separate array into input and output components\nX = array[:,0:11]\nY = array[:,11]\nscaler = MinMaxScaler(feature_range=(0, 1))\nrescaledX = scaler.fit_transform(X)\n# summarize transformed data\n#set_printoptions(precision=3)\nprint(rescaledX[0:5,:])\nnp.random.seed(42)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# **Feature selection**\n\n## This is the process of reducing the number of input variables when developing a predictive model. It is important as it reduces computational cost of modelling and often improves the purpose of the model by only selecting the features that are useful for the model that contribute most to the prediction variable"},{"metadata":{},"cell_type":"markdown","source":"## For this i chose the Recursive Feature Elimination method(RFE) which uses model accuracy to identify which attributes contribute the most to predicting the target attribute. This method fits a model and removes the weakest feature or features until the specified number of features is reached. This helped eliminate the weakest features for prediction."},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.feature_selection import RFE\nfrom sklearn.linear_model import LogisticRegression\n\narray_1 = Train.values\nX = array_1[:,0:11]\nY = array_1[:,11]\n# feature extraction\n#array = Train.values\n# separate array into input and output components\nmodel = LogisticRegression()\nrfe = RFE(model, 8)\nfit = rfe.fit(rescaledX, Y)\nprint(\"Num Features: \", fit.n_features_)\nprint(\"Selected Features:\", fit.support_)\nprint(\"Feature Ranking: \", fit.ranking_)\nnp.random.seed(42)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# **Evaluating the performance of classification models**\n\n## The purpose is to know how well an algorithm performs on unseen data and thus be able to predict more accurately on your required data. This prevents overfitting which would occur if the algorithm is trained on the same data that it will test leaving you with perfect scores that are unrealistic."},{"metadata":{},"cell_type":"markdown","source":"# **Split into train and test data**\n## This method separates your training data into a training and testing data set from which we can use a model and determine the accuracy of prediction. This method enables the use of the training set to build and train your model and test the accuracy of the model on the same dataset without using the actual data you want to predict on. So as to see how well it performs on any data with the highest accuracy. "},{"metadata":{},"cell_type":"markdown","source":"# **KNeighborsClassifier model**\n## This classifier implements the k-nearest neighbors vote by assigning weights to the contributions of neighbours such that the nearer neighbours contribute more to the average than the distant ones"},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.model_selection import train_test_split\nfrom sklearn.neighbors import KNeighborsClassifier\n\ntest_size = 0.33\nseed = 7\nX_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\nrandom_state=seed)\nmodel = KNeighborsClassifier()\nmodel.fit(X_train, Y_train)\nresult = model.score(X_test, Y_test)\nprint(\"Accuracy: \", (result*100.0))\nnp.random.seed(42)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# **Logistic Regression**\n## Describes data and explains the relationship between one variable and the other"},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.model_selection import train_test_split\nfrom sklearn.linear_model import LogisticRegression\n\ntest_size = 0.33\nseed = 7\nX_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\nrandom_state=seed)\nmodel = LogisticRegression()\nmodel.fit(X_train, Y_train)\nresult = model.score(X_test, Y_test)\nprint(\"Accuracy: \", (result*100.0))\nnp.random.seed(42)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# **GaussianNB**\n## Bases on applying Bayes’ theorem with the assumption of conditional independence between every pair of features given the value of the class variable. This model works well in most real world scenarios."},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.model_selection import train_test_split\nfrom sklearn.naive_bayes import GaussianNB\n\ntest_size = 0.33\nseed = 7\nX_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\nrandom_state=seed)\nmodel = GaussianNB()\nmodel.fit(X_train, Y_train)\nresult = model.score(X_test, Y_test)\nprint(\"Accuracy: \", (result*100.0))\nnp.random.seed(42)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# **Linear Discriminant Analysis**\n## This is a dimensionality reduction technique that reduces the dimensions while retaining as much information as possible"},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.model_selection import train_test_split\nfrom sklearn.discriminant_analysis import LinearDiscriminantAnalysis\n\ntest_size = 0.33\nseed = 7\nX_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\nrandom_state=seed)\nmodel = LinearDiscriminantAnalysis()\nmodel.fit(X_train, Y_train)\nresult = model.score(X_test, Y_test)\nprint(\"Accuracy: \", (result*100.0))\nnp.random.seed(42)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# **Quadratic Discrimination Analysis**\n\n## A variation of LDA that is useful if there is prior knowledge that individual classes exhibit distinct covariance. QDA is less strict and allows for differing covariance for different classes. QDA is flexible and hence can lead to an improved prediction performance"},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.model_selection import train_test_split\nfrom sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis\n\ntest_size = 0.33\nseed = 7\nX_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\nrandom_state=seed)\nmodel = QuadraticDiscriminantAnalysis()\nmodel.fit(X_train, Y_train)\nresult = model.score(X_test, Y_test)\nprint(\"Accuracy: \", (result*100.0))\nnp.random.seed(42)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# **Stochastic Gradient Descent**\n\n## A classification method used to find values of the parameters of a function minimizing the cost as much as possible. Stochastic implies that the process is linked with random probability where a few samples are selected at random rather than the whole data set. This method is considered because it is computationally fast as it only works on one sample at a time. It is also converges faster for larger datasets as it causes updates to the parameters more frequently"},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.model_selection import train_test_split\nfrom sklearn.linear_model import SGDClassifier\n\ntest_size = 0.33\nseed = 7\nX_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\nrandom_state=seed)\nmodel = SGDClassifier()\nmodel.fit(X_train, Y_train)\nresult = model.score(X_test, Y_test)\nprint(\"Accuracy: \", (result*100.0))\nnp.random.seed(42)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# **Decision Tree classifier**\n## It is a predictive modeling approach that uses a decision tree to go from observations about an item to conclusions about the item's target value"},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.model_selection import train_test_split\nfrom sklearn.tree import DecisionTreeClassifier\n\ntest_size = 0.33\nseed = 7\nX_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\nrandom_state=seed)\nmodel = DecisionTreeClassifier()\nmodel.fit(X_train, Y_train)\nresult = model.score(X_test, Y_test)\nprint(\"Accuracy: \", (result*100.0))\nnp.random.seed(42)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# **Random Forest Classifier**\n\n## This is a model that grows multiple trees and classifies objects based on votes votes of all the trees. It reduces the problem of overfitting or high bias"},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.model_selection import train_test_split\nfrom sklearn.ensemble import RandomForestClassifier\n\ntest_size = 0.33\nseed = 7\nX_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\nrandom_state=seed)\nmodel = RandomForestClassifier()\nmodel.fit(X_train, Y_train)\nresult = model.score(X_test, Y_test)\nprint(\"Accuracy: \", (result*100.0))\nnp.random.seed(42)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# **Support Vector Machine**\n## supervised machine learning model that uses classification algorithms for two-group classification problems"},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.model_selection import train_test_split\nfrom sklearn.svm import SVC\n\ntest_size = 0.33\nseed = 7\nX_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\nrandom_state=seed)\nmodel = SVC()\nmodel.fit(X_train, Y_train)\nresult = model.score(X_test, Y_test)\nprint(\"Accuracy: \", (result*100.0))\nnp.random.seed(42)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# **MLPC Classifier**\n\n## This is a multi-layer perceptron which utilizes supervized learning method called back propagation for training and can distinguish data that is not linearly seperable"},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.model_selection import train_test_split\nfrom sklearn.neural_network import MLPClassifier\n\ntest_size = 0.33\nseed = 7\nX_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\nrandom_state=seed)\nmodel = MLPClassifier()\nmodel.fit(X_train, Y_train)\nresult = model.score(X_test, Y_test)\nprint(\"Accuracy: \", (result*100.0))\nnp.random.seed(42)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# **Confusion matrix**\n## This is a table that is often used to describe the performance of a classification model on a set of test data for which the true values are known. It allows the visualization of the performance of an algorithm. The ideal matrix has the false positives and false negatives as 0 indication perfect performance. \n\n## I added the code to identify the Matthew's Correlation Coefficient (MCC) which is a measure of the quality of the classification and a more accurate representation of which model actually performs better."},{"metadata":{},"cell_type":"markdown","source":"## **KNeighborsClassifier**"},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.metrics import confusion_matrix\nfrom sklearn.metrics import matthews_corrcoef\n\ntest_size = 0.33\nseed = 7\nX_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\nrandom_state=seed)\n\nmodel = KNeighborsClassifier()\nmodel.fit(X_train, Y_train)\n\npredicted = model.predict(X_test)\nMCC = matthews_corrcoef(Y_test, predicted)\nmatrix = confusion_matrix(Y_test, predicted)\nprint(matrix)\nprint(MCC)\nnp.random.seed(42)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## **Logistic Regression**"},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.metrics import confusion_matrix\nfrom sklearn.metrics import matthews_corrcoef\n\ntest_size = 0.33\nseed = 7\nX_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\nrandom_state=seed)\n\nmodel = LogisticRegression()\nmodel.fit(X_train, Y_train)\n\npredicted = model.predict(X_test)\nmatrix = confusion_matrix(Y_test, predicted)\nMCC = matthews_corrcoef(Y_test, predicted)\nprint(matrix)\nprint(MCC)\nnp.random.seed(42)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## **GaussianNB**"},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.metrics import confusion_matrix\nfrom sklearn.metrics import matthews_corrcoef\n\ntest_size = 0.33\nseed = 7\nX_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\nrandom_state=seed)\n\nmodel = GaussianNB()\nmodel.fit(X_train, Y_train)\n\npredicted = model.predict(X_test)\nmatrix = confusion_matrix(Y_test, predicted)\nMCC = matthews_corrcoef(Y_test, predicted)\nprint(matrix)\nprint(MCC)\nnp.random.seed(42)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# **Linear Discriminant Analysis**"},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.metrics import confusion_matrix\nfrom sklearn.metrics import matthews_corrcoef\n\ntest_size = 0.33\nseed = 7\nX_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\nrandom_state=seed)\n\nmodel = LinearDiscriminantAnalysis()\nmodel.fit(X_train, Y_train)\n\npredicted = model.predict(X_test)\nmatrix = confusion_matrix(Y_test, predicted)\nMCC = matthews_corrcoef(Y_test, predicted)\nprint(matrix)\nprint(MCC)\nnp.random.seed(42)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# **Quadratic Discriminant Analysis**"},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.metrics import confusion_matrix\nfrom sklearn.metrics import matthews_corrcoef\n\ntest_size = 0.33\nseed = 7\nX_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\nrandom_state=seed)\n\nmodel = QuadraticDiscriminantAnalysis()\nmodel.fit(X_train, Y_train)\n\npredicted = model.predict(X_test)\nmatrix = confusion_matrix(Y_test, predicted)\nMCC = matthews_corrcoef(Y_test, predicted)\nprint(matrix)\nprint(MCC)\nnp.random.seed(42)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# **Stochastic Gradient Descent**"},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.metrics import confusion_matrix\nfrom sklearn.metrics import matthews_corrcoef\n\ntest_size = 0.33\nseed = 7\nX_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\nrandom_state=seed)\n\nmodel = SGDClassifier()\nmodel.fit(X_train, Y_train)\n\npredicted = model.predict(X_test)\nmatrix = confusion_matrix(Y_test, predicted)\nMCC = matthews_corrcoef(Y_test, predicted)\nprint(matrix)\nprint(MCC)\nnp.random.seed(42)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# **Decision Tree Classifier**"},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.metrics import confusion_matrix\nfrom sklearn.metrics import matthews_corrcoef\n\ntest_size = 0.33\nseed = 7\nX_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\nrandom_state=seed)\n\nmodel = DecisionTreeClassifier()\nmodel.fit(X_train, Y_train)\n\npredicted = model.predict(X_test)\nmatrix = confusion_matrix(Y_test, predicted)\nMCC = matthews_corrcoef(Y_test, predicted)\nprint(matrix)\nprint(MCC)\nnp.random.seed(42)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# **Random Forest Classifier**"},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.metrics import confusion_matrix\nfrom sklearn.metrics import matthews_corrcoef\n\ntest_size = 0.33\nseed = 7\nX_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\nrandom_state=seed)\n\nmodel = RandomForestClassifier()\nmodel.fit(X_train, Y_train)\n\npredicted = model.predict(X_test)\nmatrix = confusion_matrix(Y_test, predicted)\nMCC = matthews_corrcoef(Y_test, predicted)\nprint(matrix)\nprint(MCC)\nnp.random.seed(42)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# **Support Vector Manchine**"},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.metrics import confusion_matrix\nfrom sklearn.metrics import matthews_corrcoef\n\n\ntest_size = 0.33\nseed = 7\nX_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\nrandom_state=seed)\n\nmodel = SVC()\nmodel.fit(X_train, Y_train)\n\npredicted = model.predict(X_test)\nmatrix = confusion_matrix(Y_test, predicted)\nMCC = matthews_corrcoef(Y_test, predicted)\nprint(matrix)\nprint(MCC)\nnp.random.seed(42)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# **MLPC Classifier**"},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.metrics import confusion_matrix\nfrom sklearn.metrics import matthews_corrcoef\n\n\ntest_size = 0.33\nseed = 7\nX_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\nrandom_state=seed)\n\nmodel = MLPClassifier()\nmodel.fit(X_train, Y_train)\n\npredicted = model.predict(X_test)\nmatrix = confusion_matrix(Y_test, predicted)\nMCC = matthews_corrcoef(Y_test, predicted)\nprint(matrix)\nprint(MCC)\nnp.random.seed(42)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# **Classification Report**\n\n## This is a convenient report that provides precision, recall, f1- score and support for each class. This helps to provide a quick idea of the accuracy of the model \n"},{"metadata":{},"cell_type":"markdown","source":"# **KNeighborsClassifier**"},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.metrics import classification_report\n\ntest_size = 0.33\nseed = 7\nX_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\nrandom_state=seed)\nmodel = KNeighborsClassifier()\nmodel.fit(X_train, Y_train)\npredicted = model.predict(X_test)\nreport = classification_report(Y_test, predicted)\nprint(report)\nnp.random.seed(42)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# **Logistic Regression**"},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.metrics import classification_report\n\ntest_size = 0.33\nseed = 7\nX_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\nrandom_state=seed)\nmodel = LogisticRegression()\nmodel.fit(X_train, Y_train)\npredicted = model.predict(X_test)\nreport = classification_report(Y_test, predicted)\nprint(report)\nnp.random.seed(42)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# **GaussianNB**"},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.metrics import classification_report\n\ntest_size = 0.33\nseed = 7\nX_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\nrandom_state=seed)\nmodel = GaussianNB()\nmodel.fit(X_train, Y_train)\npredicted = model.predict(X_test)\nreport = classification_report(Y_test, predicted)\nprint(report)\nnp.random.seed(42)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# **Linear Discriminant Analysis**"},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.metrics import classification_report\n\ntest_size = 0.33\nseed = 7\nX_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\nrandom_state=seed)\nmodel = LinearDiscriminantAnalysis()\nmodel.fit(X_train, Y_train)\npredicted = model.predict(X_test)\nreport = classification_report(Y_test, predicted)\nprint(report)\nnp.random.seed(42)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# **Quadratic Discriminant Analysis**"},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.metrics import classification_report\n\ntest_size = 0.33\nseed = 7\nX_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\nrandom_state=seed)\nmodel = QuadraticDiscriminantAnalysis()\nmodel.fit(X_train, Y_train)\npredicted = model.predict(X_test)\nreport = classification_report(Y_test, predicted)\nprint(report)\nnp.random.seed(42)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# **Stochastic Gradient Descent**"},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.metrics import classification_report\n\ntest_size = 0.33\nseed = 7\nX_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\nrandom_state=seed)\nmodel = SGDClassifier()\nmodel.fit(X_train, Y_train)\npredicted = model.predict(X_test)\nreport = classification_report(Y_test, predicted)\nprint(report)\nnp.random.seed(42)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# **Decision Tree Classifier**"},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.metrics import classification_report\n\ntest_size = 0.33\nseed = 7\nX_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\nrandom_state=seed)\nmodel = DecisionTreeClassifier()\nmodel.fit(X_train, Y_train)\npredicted = model.predict(X_test)\nreport = classification_report(Y_test, predicted)\nprint(report)\nnp.random.seed(42)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# **Random Forest Classifier**"},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.metrics import classification_report\n\ntest_size = 0.33\nseed = 7\nX_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\nrandom_state=seed)\nmodel = RandomForestClassifier()\nmodel.fit(X_train, Y_train)\npredicted = model.predict(X_test)\nreport = classification_report(Y_test, predicted)\nprint(report)\nnp.random.seed(42)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# **Support Vector Machine**"},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.metrics import classification_report\n\ntest_size = 0.33\nseed = 7\nX_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\nrandom_state=seed)\nmodel = SVC()\nmodel.fit(X_train, Y_train)\npredicted = model.predict(X_test)\nreport = classification_report(Y_test, predicted)\nprint(report)\nnp.random.seed(42)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# **MLPC Classifier**"},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.metrics import classification_report\n\ntest_size = 0.33\nseed = 7\nX_train, X_test, Y_train, Y_test = train_test_split(rescaledX[:,(0,1,2,3,4,5,6,7)], Y, test_size=test_size,\nrandom_state=seed)\nmodel = MLPClassifier()\nmodel.fit(X_train, Y_train)\npredicted = model.predict(X_test)\nreport = classification_report(Y_test, predicted)\nprint(report)\nnp.random.seed(42)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## For downstream work,we have to rescale the test dataset so that the predictions can fit on the test data and be accurate since the training data set was rescaled"},{"metadata":{"trusted":true},"cell_type":"code","source":"array2 = Test.values\n# separate array into input and output components\nX1 = array2[:,0:10]\nY1 = array2[:,10]\nscaler = MinMaxScaler(feature_range=(0, 1))\nrescaledX1 = scaler.fit_transform(X1)\n# summarize transformed data\nset_printoptions(precision=3)\nprint(rescaledX1[0:5,:])\nnp.random.seed(42)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## The following code compares the algorithms consistently outputting accuracy scores and MCC.\n### This code uses K-fold cross validation method to evaluate the model. This method bases on randomly partitioning data into k equal sized subsamples with a single subsample retained as the validation data for testing the model, and the remaining are used as training data. The process is then repeated k times, with each of the k subsamples used exactly once as the validation data. The k results can then be averaged to produce a single estimation. This method tests the model's ability to predict new data that was not used in estimating it, in order to flag problems like overfitting. \n### I also added MCC to the code for better interpretation"},{"metadata":{"trusted":true},"cell_type":"code","source":"from pandas import read_csv\nfrom matplotlib import pyplot\nfrom sklearn.model_selection import KFold\nfrom sklearn.model_selection import cross_val_score\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.tree import DecisionTreeClassifier\nfrom sklearn.ensemble import RandomForestClassifier\nfrom sklearn.neighbors import KNeighborsClassifier\nfrom sklearn.discriminant_analysis import LinearDiscriminantAnalysis\nfrom sklearn.naive_bayes import GaussianNB\nfrom sklearn.svm import SVC\nfrom sklearn.linear_model import SGDClassifier\nfrom sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis\nfrom sklearn.neural_network import MLPClassifier\nfrom sklearn.metrics import matthews_corrcoef\n\narray = Train.values \nX = array[:,0:11]\nY = array[:,11]\n\n# prepare models and add them to a list\nmodels = []\nmodels.append(('LR', LogisticRegression()))\nmodels.append(('LDA', LinearDiscriminantAnalysis()))\nmodels.append(('KNN', KNeighborsClassifier()))\nmodels.append(('CART', DecisionTreeClassifier()))\nmodels.append(('NB', GaussianNB()))\nmodels.append(('SVM', SVC()))\nmodels.append(('RFC', RandomForestClassifier()))\nmodels.append(('SDG', SGDClassifier()))\nmodels.append(('QDA', QuadraticDiscriminantAnalysis()))\nmodels.append(('MLPC', MLPClassifier()))\n\n# evaluate each model in turn\nresults = []\nnames = []\nscoring = 'accuracy'\n\nfor name, model in models:\n kfold = KFold(n_splits=30, random_state=14)\n cv_results = cross_val_score(model, rescaledX[:,(0,1,2,3,4,5,6,7)], Y, cv=kfold, scoring=scoring)\n results.append(cv_results)\n names.append(name)\n msg = (name, cv_results.mean(), cv_results.std())\n print(msg)\n model.fit(rescaledX[:,(0,1,2,3,4,5,6,7)], Y)\n predicted = model.predict(rescaledX[:,(0,1,2,3,4,5,6,7)])\n MCC = matthews_corrcoef(Y, predicted)\n print(MCC)\n \n\n# boxplot algorithm comparison\nfig = pyplot.figure()\nfig.suptitle('Algorithm Comparison')\nax = fig.add_subplot(111)\npyplot.boxplot(results)\nax.set_xticklabels(names)\npyplot.show()\nnp.random.seed(42)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## From the above evaluations;\n### From the Split test and train data, KNeighborsClassifier and RandomForestClassifier had the highest accuracy and MCC. While from the Kfold cross validation data, GaussianNB and Support Vector Machine had the highest accuracy while Decision Tree Classifier and RandomForestClassifier had the highest MCC both equating to 1.0.\n\n### In order to determine the most ideal model for prediction, i run the prediction algorithms using each of the 4 mentioned models and submitted the csv files in order to compare the scores"},{"metadata":{},"cell_type":"markdown","source":"# KNeighborsClassifier"},{"metadata":{"trusted":true},"cell_type":"code","source":"model = KNeighborsClassifier()\nmodel.fit(rescaledX[:,(0,1,2,3,4,5,6,7)], Y)\nmodel.predict(rescaledX1[:,(0,1,2,3,4,5,6,7)])\nprediction = model.predict(rescaledX1[:,(0,1,2,3,4,5,6,7)])\ntest_pred = pd.DataFrame(prediction)\ntest_pred.columns = [\"CLASS\"]\ntest_pred.index.name = \"Index\"\ntest_pred['CLASS']= test_pred['CLASS'].map({0.0:False,1.0:True})\n\ntest_pred.to_csv(\"test_pred.csv\")\nprint(test_pred['CLASS'].unique())\nprint(test_pred.groupby('CLASS').size()[0].sum())\nprint (test_pred.groupby('CLASS').size()[1].sum())\ntest_pred","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# RandomForestClassifier"},{"metadata":{"trusted":true},"cell_type":"code","source":"model = RandomForestClassifier()\nmodel.fit(rescaledX[:,(0,1,2,3,4,5,6,7)],Y)\nmodel.predict(rescaledX1[:,(0,1,2,3,4,5,6,7)])\nprediction = model.predict(rescaledX1[:,(0,1,2,3,4,5,6,7)])\ntest_pred = pd.DataFrame(prediction)\ntest_pred.columns = [\"CLASS\"]\ntest_pred.index.name = \"Index\"\ntest_pred['CLASS']= test_pred['CLASS'].map({0.0:False,1.0:True})\n\ntest_pred.to_csv(\"test_pred.csv\")\nprint(test_pred['CLASS'].unique())\nprint(test_pred.groupby('CLASS').size()[0].sum())\nprint (test_pred.groupby('CLASS').size()[1].sum())\ntest_pred","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# GaussianNB"},{"metadata":{"trusted":true},"cell_type":"code","source":"model = GaussianNB()\nmodel.fit(rescaledX[:,(0,1,2,3,4,5,6,7)],Y)\nmodel.predict(rescaledX1[:,(0,1,2,3,4,5,6,7)])\nprediction = model.predict(rescaledX1[:,(0,1,2,3,4,5,6,7)])\ntest_pred = pd.DataFrame(prediction)\ntest_pred.columns = [\"CLASS\"]\ntest_pred.index.name = \"Index\"\ntest_pred['CLASS']= test_pred['CLASS'].map({0.0:False,1.0:True})\n\ntest_pred.to_csv(\"test_pred.csv\")\nprint(test_pred['CLASS'].unique())\nprint(test_pred.groupby('CLASS').size()[0].sum())\nprint (test_pred.groupby('CLASS').size()[1].sum())\ntest_pred","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# Support Vector Machine"},{"metadata":{"trusted":true},"cell_type":"code","source":"model = SVC()\nmodel.fit(rescaledX[:,(0,1,2,3,4,5,6,7)], Y)\nmodel.predict(rescaledX1[:,(0,1,2,3,4,5,6,7)])\nprediction = model.predict(rescaledX1[:,(0,1,2,3,4,5,6,7)])\ntest_pred = pd.DataFrame(prediction)\ntest_pred.columns = [\"CLASS\"]\ntest_pred.index.name = \"Index\"\ntest_pred['CLASS']= test_pred['CLASS'].map({0.0:False,1.0:True})\n\ntest_pred.to_csv(\"test_pred.csv\")\nprint(test_pred['CLASS'].unique())\nprint(test_pred.groupby('CLASS').size()[0].sum())\nprint (test_pred.groupby('CLASS').size()[1].sum())\ntest_pred","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## I discovered that the best submission score came from the GaussianNB that had the highest accuracy score in the Kfold cross validation and a moderate MCC compared to the other models. The KNeighborsClassifier and Random Forest Classifier models performed more poorly than the GaussianNB even though they had higher accuracy and MCC from split train and test evaluation\n\n## From this i concluded that the kfold cross validation method of evaluating machine learning algorithm performance is much more powerful and accurate than splitting into test and train data sets and therefore chose GaussianNB as my prediction model\n\n## I decicded to also test whether transforming the dataset by rescaling actually affected the model's functionality by re-evaluating using kfold cross validation to see whether there would be any change in the accuracy or MCC of the model chosen to be the most optimal"},{"metadata":{"trusted":true},"cell_type":"code","source":"from pandas import read_csv\nfrom matplotlib import pyplot\nfrom sklearn.model_selection import KFold\nfrom sklearn.model_selection import cross_val_score\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.tree import DecisionTreeClassifier\nfrom sklearn.ensemble import RandomForestClassifier\nfrom sklearn.neighbors import KNeighborsClassifier\nfrom sklearn.discriminant_analysis import LinearDiscriminantAnalysis\nfrom sklearn.naive_bayes import GaussianNB\nfrom sklearn.svm import SVC\nfrom sklearn.linear_model import SGDClassifier\nfrom sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis\nfrom sklearn.neural_network import MLPClassifier\nfrom sklearn.metrics import matthews_corrcoef\n\narray = Train.values \nX = array[:,0:11]\nY = array[:,11]\n\n# prepare models and add them to a list\nmodels = []\nmodels.append(('LR', LogisticRegression()))\nmodels.append(('LDA', LinearDiscriminantAnalysis()))\nmodels.append(('KNN', KNeighborsClassifier()))\nmodels.append(('CART', DecisionTreeClassifier()))\nmodels.append(('NB', GaussianNB()))\nmodels.append(('SVM', SVC()))\nmodels.append(('RFC', RandomForestClassifier()))\nmodels.append(('SDG', SGDClassifier()))\nmodels.append(('QDA', QuadraticDiscriminantAnalysis()))\nmodels.append(('MLPC', MLPClassifier()))\n\n# evaluate each model in turn\nresults = []\nnames = []\nscoring = 'accuracy'\n\nfor name, model in models:\n kfold = KFold(n_splits=30, random_state=14)\n cv_results = cross_val_score(model, X[:,(0,1,2,3,4,5,6,7)], Y, cv=kfold, scoring=scoring)\n results.append(cv_results)\n names.append(name)\n msg = (name, cv_results.mean(), cv_results.std())\n print(msg)\n model.fit(X[:,(0,1,2,3,4,5,6,7)], Y)\n predicted = model.predict(X[:,(0,1,2,3,4,5,6,7)])\n MCC = matthews_corrcoef(Y, predicted)\n print(MCC)\n \n\n# boxplot algorithm comparison\nfig = pyplot.figure()\nfig.suptitle('Algorithm Comparison')\nax = fig.add_subplot(111)\npyplot.boxplot(results)\nax.set_xticklabels(names)\npyplot.show()\nnp.random.seed(42)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## It was discovered that evaluating non-rescaled data didn't change GaussianNB's accuracy or MCC, as well as most of the models. But it did change for the following;\n### - KNeighborsClassifier; where the accuracy and MCC both reduced\n### - DecisionTreeClassifier; where the accuracy reduced\n### - RandomForestClassifier; where accuracy reduced\n### - Support Vector Machine; where the accuracy and MCC both reduced\n### - Stochastic Gradient Descent; where the accuracy and MCC both reduced\n### - MLPClassifier; where the MCC increased\n\n## This prompted me to make a prediction using also the non-rescaled data to see if there would be a change in the score submission."},{"metadata":{"trusted":true},"cell_type":"code","source":"model = GaussianNB()\nmodel.fit(X[:,(0,1,2,3,4,5,6,7)],Y)\nmodel.predict(X1[:,(0,1,2,3,4,5,6,7)])\nprediction = model.predict(X1[:,(0,1,2,3,4,5,6,7)])\ntest_pred = pd.DataFrame(prediction)\ntest_pred.columns = [\"CLASS\"]\ntest_pred.index.name = \"Index\"\ntest_pred['CLASS']= test_pred['CLASS'].map({0.0:False,1.0:True})\n\ntest_pred.to_csv(\"test_pred.csv\")\nprint(test_pred['CLASS'].unique())\nprint(test_pred.groupby('CLASS').size()[0].sum())\nprint (test_pred.groupby('CLASS').size()[1].sum())\ntest_pred","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## From this prediction, the submission score was found to be slightly higher than the rescaled data.\n### It was also discovered that the higher the number of features selected the higher the accuracy of the models and the higher the score after submission.\n### The final testing was to see whether the score would improve if all features were considered and the prediction was done on all the data"},{"metadata":{"trusted":true},"cell_type":"code","source":"model = GaussianNB()\nmodel.fit(X[:,0:10],Y)\nmodel.predict(X1)\nprediction = model.predict(X1)\ntest_pred = pd.DataFrame(prediction)\ntest_pred.columns = [\"CLASS\"]\ntest_pred.index.name = \"Index\"\ntest_pred['CLASS']= test_pred['CLASS'].map({0.0:False,1.0:True})\n\ntest_pred.to_csv(\"test_pred.csv\")\nprint(test_pred['CLASS'].unique())\nprint(test_pred.groupby('CLASS').size()[0].sum())\nprint (test_pred.groupby('CLASS').size()[1].sum())\ntest_pred","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### It was discovered that the highest score was obtained from the set which was predicted without features selected."},{"metadata":{},"cell_type":"markdown","source":"### In conclusion, Kfold cross validation is a better tool for evaluating the performance of a machine learning algorithm than splitting the data into test and training sets. It was also noted that accuracy or MCC alone are not efficient measures of model functionality as we saw the models with the highest accuracy in split test and train data did not perform best upon submission. The best model for this data set is the GaussianNB model which coincides with previous claims that it works best for real world data. An MCC OF 1.0 indicates a perfect agreement between actuals and predictions. Thus we would expect that KNeighborsClassifier and RandomForestClassifier would perform the best at prediction. However we see that they did not perform as expected which can be put down to a high rate of false positives and false negatives being generated.\n### Therefore we need to consider several factors such as precision and F1-score before we conclude which model is best."},{"metadata":{},"cell_type":"markdown","source":"## Literature:\n### -wikipedia\n### -class notes\n### -https://scikit-learn.org/stable/supervised_learning.html#supervised-learning\n### -https://medium.com/datadriveninvestor/classification-algorithms-in-machine-learning-85c0ab65ff4\n### -https://www.geeksforgeeks.org/ml-stochastic-gradient-descent-sgd/\n### -https://data-flair.training/blogs/machine-learning-classification-algorithms/\n### -https://www.datascienceblog.net/post/machine-learning/linear-discriminant-analysis/\n### -https://towardsdatascience.com/data-visualization-for-machine-learning-and-data-science-a45178970be7\n### -https://seaborn.pydata.org/generated/seaborn.distplot.html\n### -https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-019-6413-7"}],"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"pygments_lexer":"ipython3","nbconvert_exporter":"python","version":"3.6.4","file_extension":".py","codemirror_mode":{"name":"ipython","version":3},"name":"python","mimetype":"text/x-python"}},"nbformat":4,"nbformat_minor":4}
\ No newline at end of file