Big Data: My First Kaggle. My First SciKit Learn.

25 Jan


Recently, I saw a tweet that linked to the Kaggle blog. I had no idea what Kaggle was. Apparently, it’s making data science a sport. The reason the tweet caught my attention was because I had started reading “Data Mining: Practical Machine Learning Tools and Techniques.” Googling “python machine learning” brought me to scikit learn. I remembered the PyData conference posted several videos on machine learning and had three videos on Scikit-learn. Looked like fun, so I entered the Titanic Kaggle Competition.

The competition provides a set of training data (people on the Titanic and whether they lived or died) and a set of test data. I still don’t know too much about the techniques of machine learning, but I was able to figure out how to run LogisticRegression and LinearSVC. Slapping together two simple python programs,  I scored 75% with SVC and 76% with Regression. I did have to fudge the training data to scrub text strings and replace missing data with 0’s. This probably affected my model – also used defaults – but I was happy to even get it to run and format the data.

Now I am motivated to learn more about the methods and parameters of the models. A side note – the Regression model gave probabilities for whether a person lived or dies, I accepted the model if it was greater than .50 probability.

Here is my code for the SVC. The Regression is almost identical.

import csv as csv
import numpy as np
from sklearn.svm import LinearSVC

train=csv.reader(open(‘titanictrain.csv’,’rb’)) #open file

for row in train:
category=csv.reader(open(‘titanictarget.csv’,’rb’)) #open file

for row in category:

clf =,y)

for row in test:




One Response to “Big Data: My First Kaggle. My First SciKit Learn.”

  1. Gaurav Kumar September 8, 2013 at 8:06 am #

    You call 500 samples – Big Data. Try the new Facebook challenge- that’s Big Data- 9 GB files- 145 million lines – 6 million rows.

