Big Data/Analytics Zone is brought to you in partnership with:

Mark is a graph advocate and field engineer for Neo Technology, the company behind the Neo4j graph database. As a field engineer, Mark helps customers embrace graph data and Neo4j building sophisticated solutions to challenging data problems. When he's not with customers Mark is a developer on Neo4j and writes his experiences of being a graphista on a popular blog at http://markhneedham.com/blog. He tweets at @markhneedham. Mark is a DZone MVB and is not an employee of DZone and has posted 543 posts at DZone. You can read more from them at their website. View Full User Profile

Python: Making scikit-learn and Pandas Play Nice

11.14.2013
| 8519 views |
  • submit to reddit

In the last post I wrote about Nathan and my attempts at the Kaggle Titanic Problem, I mentioned that our next step was to try out scikit-learn, so I thought I should summarize where we’ve got up to.

We needed to write a classification algorithm to work out whether a person onboard the Titanic survived, and luckily, scikit-learn has extensive documentation on each of the algorithms.

Unfortunately almost all those examples use numpy data structures and we’d loaded the data using pandas and didn’t particularly want to switch back!

Luckily it was really easy to get the data into numpy format by calling ‘values’ on the pandas data structure, something we learnt from a reply on Stack Overflow.

For example if we were to wire up an ExtraTreesClassifier which worked out survival rate based on the ‘Fare’ and ‘Pclass’ attributes we could write the following code:

import pandas as pd
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.cross_validation import cross_val_score
 
train_df = pd.read_csv("train.csv")
 
et = ExtraTreesClassifier(n_estimators=100, max_depth=None, min_samples_split=1, random_state=0)
 
columns = ["Fare", "Pclass"]
 
labels = train_df["Survived"].values
features = train_df[list(columns)].values
 
et_score = cross_val_score(et, features, labels, n_jobs=-1).mean()
 
print("{0} -> ET: {1})".format(columns, et_score))

To start with with read in the CSV file which looks like this:

$ head -n5 train.csv 
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S

Next we create our classifier which “fits a number of randomized decision trees (a.k.a. extra-trees) on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting.“ i.e. a better version of a random forest.

On the next line we describe the features we want the classifier to use, then we convert the labels and features into numpy format so we can pass the to the classifier.

Finally we call the cross_val_score function which splits our training data set into training and test components and trains the classifier against the former and checks its accuracy using the latter.

If we run this code we’ll get roughly the following output:

$ python et.py
 
['Fare', 'Pclass'] -> ET: 0.687991021324)

This is actually a worse accuracy than we’d get by saying that females survived and males didn’t.

We can introduce ‘Sex’ into the classifier by adding it to the list of columns:

columns = ["Fare", "Pclass", "Sex"]

If we re-run the code we’ll get the following error:

$ python et.py
 
An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (514, 0))
...
Traceback (most recent call last):
  File "et.py", line 14, in <module>
    et_score = cross_val_score(et, features, labels, n_jobs=-1).mean()
  File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/cross_validation.py", line 1152, in cross_val_score
    for train, test in cv)
  File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/externals/joblib/parallel.py", line 519, in __call__
    self.retrieve()
  File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/externals/joblib/parallel.py", line 450, in retrieve
    raise exception_type(report)
sklearn.externals.joblib.my_exceptions.JoblibValueError/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/externals/joblib/my_exceptions.py:26: DeprecationWarning: BaseException.message has been deprecated as of Python 2.6
  self.message,
: JoblibValueError
___________________________________________________________________________
Multiprocessing exception:
...
ValueError: could not convert string to float: male
___________________________________________________________________________

This is a slightly verbose way of telling us that we can’t pass non numeric features to the classifier – in this case ‘Sex’ has the values ‘female’ and ‘male’. We’ll need to write a function to replace those values with numeric equivalents.

train_df["Sex"] = train_df["Sex"].apply(lambda sex: 0 if sex == "male" else 1)

Now if we re-run the classifier we’ll get a slightly more accurate prediction:

$ python et.py 
['Fare', 'Pclass', 'Sex'] -> ET: 0.813692480359)

The next step is to use the classifier against the test data set, so let’s load the data and run the prediction:

test_df = pd.read_csv("test.csv")
 
et.fit(features, labels)
et.predict(test_df[columns].values)

Now if we run that:

$ python et.py 
['Fare', 'Pclass', 'Sex'] -> ET: 0.813692480359)
Traceback (most recent call last):
  File "et.py", line 22, in <module>
    et.predict(test_df[columns].values)
  File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/ensemble/forest.py", line 444, in predict
    proba = self.predict_proba(X)
  File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/ensemble/forest.py", line 479, in predict_proba
    X = array2d(X, dtype=DTYPE)
  File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/utils/validation.py", line 91, in array2d
    X_2d = np.asarray(np.atleast_2d(X), dtype=dtype, order=order)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/core/numeric.py", line 235, in asarray
    return array(a, dtype, copy=False, order=order)
ValueError: could not convert string to float: male

which is the same problem we had earlier! We need to replace the ‘male’ and ‘female’ values in the test set too so we’ll pull out a function to do that now.

def replace_non_numeric(df):
	df["Sex"] = df["Sex"].apply(lambda sex: 0 if sex == "male" else 1)
	return df

Now we’ll call that function with our training and test data frames:

train_df = replace_non_numeric(pd.read_csv("train.csv"))
...
test_df = replace_non_numeric(pd.read_csv("test.csv"))

If we run the program again:

$ python et.py 
['Fare', 'Pclass', 'Sex'] -> ET: 0.813692480359)
Traceback (most recent call last):
  File "et.py", line 26, in <module>
    et.predict(test_df[columns].values)
  File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/ensemble/forest.py", line 444, in predict
    proba = self.predict_proba(X)
  File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/ensemble/forest.py", line 479, in predict_proba
    X = array2d(X, dtype=DTYPE)
  File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/utils/validation.py", line 93, in array2d
    _assert_all_finite(X_2d)
  File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/utils/validation.py", line 27, in _assert_all_finite
    raise ValueError("Array contains NaN or infinity.")
ValueError: Array contains NaN or infinity.

There are missing values in the test set so we’ll replace those with average values from our training set using an Imputer:

from sklearn.preprocessing import Imputer
 
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit(features)
 
test_df = replace_non_numeric(pd.read_csv("test.csv"))
 
et.fit(features, labels)
print et.predict(imp.transform(test_df[columns].values))

If we run that it completes successfully:

$ python et.py 
['Fare', 'Pclass', 'Sex'] -> ET: 0.813692480359)
[0 1 0 0 1 0 0 1 1 0 0 0 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 0 1 0 0 0 1 0 1 0 0
 0 0 1 0 0 0 1 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 1 1 0 1 0
 1 0 0 1 0 1 1 0 0 0 0 0 1 0 1 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0
 1 1 1 1 0 0 1 1 1 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 1 0 0 1 0 0 1 0 0 1 1 1 1 0 0 1 0 0 1 0 0 0 0 0 0 1 1 1 1 1 0 0 1 0 1
 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 1 0
 1 0 1 0 0 1 0 0 0 1 0 0 0 0 1 0 1 1 1 1 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 1
 0 0 0 1 1 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0
 1 0 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0
 1 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0 1 0 0 0 1 1
 0 1 0 0 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0
 0 1 1 1 1 0 0 1 0 0 0]

The final step is to add these values to our test data frame and then write that to a file so we can submit it to Kaggle.

The type of those values is ‘numpy.ndarray’ which we can convert to a pandas Series quite easily:

predictions = et.predict(imp.transform(test_df[columns].values))
test_df["Survived"] = pd.Series(predictions)

We can then write the ‘PassengerId’ and ‘Survived’ columns to a file:

test_df.to_csv("foo.csv", cols=['PassengerId', 'Survived'], index=False)

Then output file looks like this:

$ head -n5 foo.csv 
PassengerId,Survived
892,0
893,1
894,0

The code we’ve written is on github in case it’s useful to anyone.

Published at DZone with permission of Mark Needham, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)