What will you learn in this blog?
You will learn about some basic usages in pandas that helps you process your data like iloc, isnull(),head(). You will learn about Encoding — why and types — LabelEncoder and One Hot Encoder. We will also see how Random Forest Classifier can be trained and on how confusion matrixes help us determine the accuracy of our model. We would be using sklearn throughout the blog
Of all the cases of cancer, Breast cancer is a rather common one. In fact in the United States, breast cancer is the most common cancer diagnosed in women after lung cancer that leads to death. Breast cancer can occur in both men and women, but it’s far more common in women.
Advances in screening and treatment have improved survival rates dramatically since 1989. There are around 3.1 million breast cancer survivors in the United States (U.S.). The chance of any woman dying from breast cancer is around 1 in 37 or 2.7 percent. But even in 2017, around 252, 710 new diagnoses of breast cancer are expected in women, and around 40,610 women are likely to die from the disease. The numbers have improved since then mostly due to breast cancer awareness and good treatment.
Detecting breast cancer early
According to Breastcancer.org,
breast self-exam, or regularly examining your breasts on your own, can be an important way to find breast cancer early when it’s more likely to be treated successfully. While no single test can detect all breast cancers early, they believe that performing breast self-exam in combination with other screening methods can increase the odds of early detection.
By breast self-exam, we mean checking for lumps. But the fact is that not all lumps are cancerous.
So in this blog, we are going to attempt to predict the chances that the lump is to be malignant or benign. This is a pretty common and simple problem, this is my take at it.
The Data Set
I used this dataset from Kaggle.
import pandas as pd data.shapedata= pd.read_csv('data.csv')
data.shape #To find the dimensions of the dataset data.head()
Data.shape will give you (569, 33), which means you have 569 rows/data points with 33 attributes.
This is how your dataset looks like.
Let’s check if our dataset has any null or empty values.
# Missing values data.isnull().sum() data.isna().sum()
It would return zero for all attributes if none is missing.
Input and Output
Consider our dataset. Given a list of attributes, we are trying to predict if the tumor/lump is malignant or benign. So column 2 is our result to be predicted or output and the rest are our inputs. The column id contains no relevance to our problem.
Let’s split it.
X = data.iloc[:,2:32].values Y = data.iloc[:,1].values
iloc in Pandas Dataframe is used for integer-location based indexing/selection by position.The iloc indexer syntax is data.iloc[<row selection>, <column selection>]
In simple words, “iloc” in pandas is used to select rows and columns by number, in the order that they appear in the data frame.
Now, notice how your Y is.
In a very basic sense, Machine Learning Models don’t understand text as it is. So we convert them into the language that they can understand — numbers.
This converting of our text values to a number is called encoding is ML.
There are two types of encoding in ML
One hot Encoder
This is simple straight forward encoding. We convert categorical text data into model-understandable numerical data, in this case using the Label Encoder class of sklearn. So all we have to do, to label encode a column, is import the LabelEncoder class from the sklearn library, fit and transform the first column of the data, and then replace the existing text data with the new encoded data.
from sklearn.preprocessing import LabelEncoder labelencoder = LabelEncoder() x[:, 0] = labelencoder.fit_transform(x[:, 0])
Label encoding introduces a new problem. Sometimes we encode categorical data who have no relation, of any kind, between the rows. Since there are different numbers in the same column, the model will misunderstand the data to be in some kind of order. Eg: 3>2>1>0. But this isn’t the case at all. To overcome this problem, we use One Hot Encoder.
So, what it does is as follows.
First, it takes a column with categorical data already label encoded. It then splits the column into multiple columns, as many as the categories. Now each of these categories column values is replaced by 1 or 0 depending on whether it is in the correct category or not.
So instead of one column answering, let’s say what each fruit is. We have multiple rows for each fruit. So our column for let’s say Apple has 1 for all rows who are apple and 0 otherwise.
from sklearn.preprocessing import OneHotEncoder onehotencoder = OneHotEncoder(categorical_features = ) x = onehotencoder.fit_transform(x).toarray()
For our case, a simple LabelEncoder would do.
#Encoding categorical data values from sklearn.preprocessing import LabelEncoder labelencoder_Y = LabelEncoder() Y = labelencoder_Y.fit_transform(Y)
Test and Train Data Set
So, like any ML problem, we are going to have to split our dataset into two, one to learn from and the other to test. We will leave 25% of our whole data to test.
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25, random_state = 0)
So most of the times, our data contains values that vary very much, anything from nanometers to kilometers for an example. The problem is that most algorithms just take the magnitude dropping the units, so features with large magnitude will weigh more than others.
Look at our data values. The range is too large. So, we scale them to an acceptable range.
#Feature Scaling from sklearn.preprocessing import StandardScaler sc = StandardScaler() X_train = sc.fit_transform(X_train) X_test = sc.transform(X_test)
Now we are all set!
Train your Model
We are going to use the Random Forest Classifier. Which model to use and where is a discussion for another detailed blog in itself.
Random Forest Classifier is an ensemble algorithm, ie they combine one or more algorithm of the same type for classifying objects.
In simple words, a Random Forest Classifier creates a set of decision trees from a randomly selected subset of the training set. It then aggregates the votes from different decision trees to decide the final class of the test object.
This is a good read if you want to learn more about it.
Let’s see how it’s implemented in our case.
from sklearn.ensemble import RandomForestClassifier classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0) classifier.fit(X_train, Y_train)
Here n_estimators stand for the number of decision trees, criterion determines the measure the quality of a split. “gini” for Gini Impurity and “entropy” for Information Gain. random_state, if int, is the seed used by the random number generator. If None, the random number generator is the RandomState instance used by np.random.
Test your Model
Now that you have created and trained your model, let’s test to see how well it performs.
Let’s store all the predictions at Y_predict
Y_pred = classifier.predict(X_test)
Now, to see how well our model has predicted, I am going to look at it’s confusion matrix.
from sklearn.metrics import confusion_matrix cm = confusion_matrix(Y_test, Y_pred)
It has actually performed well! 89 of one class and 52 of another was correctly predicted. Just one instance of both classes were predicted wrong. That’s a pretty good result.
Let’s see the accuracy.
from sklearn.metrics import accuracy_score accuracy = accuracy_score(Y_test,Y_pred)
98% is a pretty great accuracy.
If you found the project useful, kindly upvote the kaggle kernel
We have successfully downloaded a dataset, prepped data by encoding and feature scaling, trained a Random Forest Classifier and tested it. I hope you had a good time learning about a solution approach to this problem. Feel free to leave feedback and interesting problems you would like to see solved like this.