Introduction to Kaggle and Scoring Top 7% in the Titanic Competition

Author:Murphy | View: 25870 | Time: 2025-03-22 21:57:43

Kaggle is a fun platform hosting a variety of data science and machine learning competitions – covering topics such as sports, energy or autonomous driving.

In this post we will give an introduction to Kaggle, and tackle the introductory "Titanic" challenge. We will explain how to approach and solve such a challenge, and demonstrate this with a top 7% solution for "Titanic".

You can find the full code on Github, and with that following along while reading this article, as well as reproduce my exact score. In it, we follow some things I consider best practice for Python and use helpful tools, such as mypy and poetry. With that being said, let's dive right into it.

Kaggle

Kaggle offers a wide variety of Data Science / machine learning competitions, see the intro for examples. It is a great way to test and improve your data science / ML knowledge and learn how to solve problems hands-on. Plus, you can even win monetary prices! However, Kaggle is populated by some of the best data scientists and ML people out there – and prices are only given to the few top solutions (out of several hundreds or thousands) – thus winning here is extremely hard and rare, and should not be your main motivation when starting.

Each (most?) competition comes with a story – a purpose – and a dataset. You are then tasked to understand the data, and solve the desired problem. If you want, you can submit your solutions to the platform, and get ranked on a public leaderboard – that is, your solution is ranked on a held-out test set. However, to avoid cheating or optimizing against this by simply spamming submissions, once the competition time (usually a few weeks to months) has expired, all competitors / teams are ranked vs a private test set – deciding the ultimate winners.

In the following, we will show how to understand the data, create a model, and submit to Kaggle following the introductory Titanic competition.

Titanic – Machine Learning from Disaster

Most probably know the story of the cruise ship Titanic and its infamous demise: a long time ago, although sadly still a tragedy with many lives lost.

Kaggle offers "tutorial" competitions, in which you can learn and train without any time constraints. One of these is the mentioned "Titanic" competition, for which you have to predict which passengers will survive.

We get two csv files – train.csv and test.csv. Both contain different features, such as Sex and Name of the passengers, and train contains an extra column "Survived". Thus, we are tasked to learn a model, which, given the test features, can predict whether that passenger survived or not.

To make a submission, we simply make our own csv file consisting of PassengerID: survived, and upload this to Kaggle (either via drag & drop, or from the command line – as we will see later).

In details, we have the following features:

pclass: ticket class (1, 2 or 3) – something similar to economy or business class in today's travel
Sex: sex of the passenger (male / female)
Age: age in years
sibsp: # siblings / spouses aboard
parch: # parents / children aboard
ticket: the ticket number
fare: passenger fare – how much the passenger paid for the ticket
cabin: cabin number
embarked: port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

Data Analysis and Feature Selection

First step of every data science / ML problem, every competition … you tackle, should always be understanding and visualizing the data – also known as Exploratory Data Analysis (EDA). Every good ML project stands and falls with the amount and quality of data available, and engineers spend a large amount of time collecting and preparing this data – and not necessarily working on complex models.

Thus, let's do exactly that with the Titanic dataset. I have downloaded the materials from the Kaggle competition into my project folder, in a subfolder called "titanic".

Let's start with the discrete data. For this, we create a function plotting the survival ratio per class (reminder, you can find the full code on Github):

def plot_survival_ratio(df: pd.DataFrame, column: str) -> None:
    # Replace NaN tokens by "Unk(nown)"
    df[column].fillna("Unk", inplace=True)

    # Calculate survival ratio per existing value in the column
    total_counts = df[column].value_counts()
    survival_counts = (
        df.groupby([column, "Survived"]).size().unstack().fillna(0)
    )
    column_values = df[column].unique()
    survival_ratios = [
        survival_counts.loc[value, 1] / total_counts[value]
        for value in column_values
    ]

    color_mapping = plt.cm.get_cmap("tab10", len(survival_counts))
    bar_colors = color_mapping(np.arange(len(column_values)))

    # Plot the survival ratios
    plt.bar(column_values, survival_ratios, color=bar_colors)
    plt.title(f"Survival Ratio by {column}")
    plt.xlabel(column)
    plt.ylabel("Survival Ratio")
    plt.ylim(0, 1)
    plt.show()

Now, let's exemplary look at a few available features, starting with Sex:

df = pd.read_csv("titanic/train.csv")
plot_survival_ratio(df, "Sex")

As we can see, Sex makes a big difference on survival – seemingly the famous line of "Women (and children)" first was followed flawlessly.

Next, let's look at passenger class:

As one could suspect, passengers with a higher (lower in numbers) class were probably given priority treatment.

Now, let's move to some less obvious features, and actually generate new features fusing and modifying existing columns – a process called feature engineering.

Let's begin by name: while nowadays it might be tempting to use Large language models or similar to process the name and extract interesting characteristics, we are contempt with manually extracting a single piece of information, namely the title of said person (which, by the way, is modified and taken from this helpful Titanic tutorial):

def extract_title(df: pd.DataFrame) -> None:
    """Extracts and adds title information / column to data frame,

    Args:
        df: data frame to add "Title" to
    """
    # Match any consecutive string ending with ".",
    # but only grab first part without "."
    df["Title"] = df.Name.str.extract(r"([A-Za-z]+).", expand=False)

    # Map other titles / spellings to some fixed "common titles",
    # and map others to "Others"
    common_titles = ["Mr", "Miss", "Mrs", "Master"]
    df["Title"].replace(["Ms", "Mlle", "Mme"], "Miss", inplace=True)
    df["Title"].replace(["Lady"], "Mrs", inplace=True)
    df["Title"].replace(["Sir", "Rev"], "Mr", inplace=True)
    df["Title"][~df.Title.isin(common_titles)] = "Others"

Looking at this new feature gives us further interesting insights into the dataset: again, we note that female passengers are more likely to survive, but also that a higher social status as possibly indicated by the status "Master" significantly increases odds of survival for the male passengers.

Next, we analyze the number of parents / children and siblings / spouses. Here, we hypothesize that a larger family size increases odds of survival – family members could help each other, or potentially get priority regarding life rafts, too. Due to the limited amount of available data, and the high number of combinations of of these features, we extract a single feature corresponding with our hypothesis:

family_size = df["Parch"] + df["SibSp"]
df["Alone"] = family_size == 0

I.e., we add a new binary feature indicating whether a passenger is traveling alone or not. And, looking at the resulting plot, we find our hypothesis confirmed:

With that, let's come to the continuous features in the dataset, starting with Age. Let's make a function generating a histogram of the feature in question vs., again, the survival ratio:

def plot_survival_cont(df: pd.DataFrame, column: str) -> None:
    # Drop NaN values
    df[column].dropna(inplace=True)

    # Bin values into 10 categories
    num_bins = 10
    bin_edges = np.linspace(min(df[column]), max(df[column]), num_bins)
    bin_centers = [
        (bin_edges[i] + bin_edges[i + 1]) / 2
        for i in range(len(bin_edges) - 1)
    ]
    bin_width = bin_edges[1] - bin_edges[0]

    df[f"{column}_bin"] = pd.cut(
        df[column], bins=bin_edges, include_lowest=True, right=True
    )

    # Calculate the ratio of 'Survived' per 'column' bin
    ratio_survived_per_bin = df.groupby(f"{column}_bin")["Survived"].mean()

    plt.bar(
        bin_centers,
        ratio_survived_per_bin.values,
        width=bin_width * 0.8,
    )
    plt.xlabel(column)
    plt.ylabel("Ratio of Survived")
    plt.title(f"Ratio of Survived per {column} Bin")
    plt.show()

Note we are dropping NaN rows in this function to not distort our histogram – but need to handle these values later in the model.

For Age, we obtain the following:

Although the results are not super clear, it looks like younger people have a higher chance of survival – which makes sense, as children were told to evacuate first along with the women – with a curious spike of survival ratio for our seniors (I will not delve into this here and leave it for discussion – let me know in the comments what you think – maybe it goes hand-in-hand with a higher social status?).

Next, we repeat the same plot for Fare:

Here we see, that the lower the fare, on average the chances of survival are less – which makes sense and agrees with our previous diagram displaying the survival ratio per passenger class.

For brevity, we have left out discussing all features. And a good Kaggler / data scientist would dive much deeper here – as mentioned in the introduction, understanding the data is so important here. I decided not to, as this post merely is an introductory tutorial, and we already achieve good results with the presented features. Further, my main area of expertise and interest is Machine Learning – i.e. discussing and sharing interesting modelling ideas with you all – which we will do in the next section.

Modelling

Having a rough understanding of the data, let's come to training a model to make our very first submission to Kaggle (potentially). Since I am mainly interested in neural networks and similar models, I will use a simple MLP to fit the data. Note this might (or surely is) not be the best choice, among others due to the small dataset size – but it works for our purposes.

First though, let's bring the data in the right shape, and re-discuss used features. To begin, we extract the additional features:

Python">def prepare_data(path: str) -> pd.DataFrame:
    """Prepares data, i.e. reads data file
    and generates all required features.

    Args:
        path: path to data file

    Returns:
        resulting data frame
    """
    df = pd.read_csv(path)

    # Replace NaN / Unk tokens
    df["Age"].fillna(0, inplace=True)
    df["Embarked"].fillna("Unk", inplace=True)

    # Add "Title" column
    extract_title(df)

    # Derive whether passenger is travelling alone
    family_size = df["Parch"] + df["SibSp"]
    df["Alone"] = family_size == 0

    return df

For the model, we will consider the categorical features "Sex", "Pclass", "Embarked", "Title" and "Alone". Further, we will use the numerical features "Age" and "Fare". Note it might be helpful to discretize these into bins, in order for the model to avoid overfitting – but I wanted to showcase how to handle continuous data, too.

Next, we process this dataset to generate normalized features, and split it into train and validation:

def prepare_train_data(
    df: pd.DataFrame,
) -> tuple[DataLoader, DataLoader, ColumnTransformer]:
    """Gets train data ready for the model.
    Extracts column of relevance from the dataset,
    splits into train and val, then transforms
    the features via scikit functionality.
    Lastly generates data loaders for train and val.

    Args:
        df: full dataset

    Returns:
        tuple of: data loaders for train and val, fitted feature transformation pipeline
    """
    x = df[CATEGORICAL_COLUMNS + NUMERIC_COLUMNS]
    y = df[TARGET_COLUMN].values

    # Split dataset into train and test
    x_train, x_val, y_train, y_val = train_test_split(
        x, y, test_size=TEST_RATIO, random_state=RANDOM_SEED
    )

    # Data pre-processing pipeline - fit with train
    # then apply to test
    pipeline = ColumnTransformer(
        [
            ("num", StandardScaler(), NUMERIC_COLUMNS),
            (
                "cat",
                OneHotEncoder(sparse=False, handle_unknown="ignore"),
                CATEGORICAL_COLUMNS,
            ),
        ]
    )
    x_train = pipeline.fit_transform(x_train)
    x_val = pipeline.transform(x_val)

    return (
        generate_data_loader(x_train, y_train, True),
        generate_data_loader(x_val, y_val, False),
        pipeline,
    )

Let's go over this line by line, but first a general comment on which datasets we will use: as stated above, the Titanic competition comes with two csv files, train and test. You work with train, and then have to submit predictions for test. We leave "test" as our test subset, but further split "train" into two subsets, namely train and validation. We will train on train, and validate our models on val – e.g. to determine when to do early-stopping during training.

Towards the beginning of above function, we do this splitting. Then, we use a neat feature from sklearn to bring our data into a format processable by our MLP: for continuous data this is standardization, for discrete we convert features to one-hot vectors.

For the train set, this pipeline is "fitted", where e.g. the sample mean and variance are estimated. Then, for val and test (this is why we return the pipeline, too), the fitted pipeline is used to not leak any kind of information to the test part.

Our model file looks as follows – a standard MLP with three layers:

import torch
import torch.nn as nn

class MLP(nn.Module):
    def __init__(self, input_size: int, output_size: int) -> None:
        """Simple MLP model with which we will fit the Titanic dataset."""
        super(MLP, self).__init__()
        self.fc1 = nn.Linear(input_size, 64)
        self.fc2 = nn.Linear(64, 128)
        self.fc3 = nn.Linear(128, output_size)
        self.relu = nn.ReLU()
        self.sigmoid = nn.Sigmoid()

    def forward(self, x: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
        """Model forward.

        Args:
            x: input tensor [bs, input_size]

        Returns:
            tuple of output value before sigmoid [bs, 2] and resulting prediction [bs]
        """
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        x = self.relu(x)
        x = self.fc3(x)
        pred = self.sigmoid(x)
        return x, torch.argmax(pred, dim=1)

forward() on the one hand returns the unnormalized logits, since we will be using the BCEWithLogitsLoss loss for better numerical stability, as well as the resulting predictions obtained by applying sigmoid and argmax, since we will be using these quite often.

With that, we can define the training loop of our program:

def train() -> tuple[ColumnTransformer, torch.nn.Module]:
    """Main training loop.

    Returns:
        tuple of fit transformation pipeline and trained model
    """
    df = prepare_data(PATH_TO_TRAIN_FILE)

    (
        train_loader,
        val_loader,
        pipeline,
    ) = prepare_train_data(df)

    model = MLP(INPUT_SIZE, OUTPUT_SIZE)
    criterion = nn.BCEWithLogitsLoss()
    optimizer = optim.Adam(model.parameters(), lr=LR, weight_decay=1e-3)
    scheduler = MultiStepLR(optimizer, milestones=[80], gamma=0.1)

    best_test_acc = 0

    # Training loop
    for epoch in range(NUM_EPOCHS):
        model.train()
        running_loss = 0.0
        correct = 0

        for inputs, labels in train_loader:
            optimizer.zero_grad()
            sigmoids, predidctions = model(inputs)

            loss = criterion(
                sigmoids,
                torch.nn.functional.one_hot(
                    labels.long(), num_classes=2
                ).float(),
            )
            loss.backward()
            optimizer.step()
            running_loss += loss.item()

            correct += (predidctions == labels).float().mean().item()

        scheduler.step()

        loss = running_loss / len(train_loader)
        acc = correct / len(train_loader)
        print(
            f"Epoch [{epoch + 1}/{NUM_EPOCHS}], Loss: {loss:.4f}, Accuracy: {acc:.4f}"
        )

        if epoch % 10 == 9:
            # Evaluation on the test set
            model.eval()
            with torch.no_grad():
                correct = 0
                for inputs, labels in val_loader:
                    _, predictions = model(inputs)
                    correct += (predictions == labels).float().mean().item()

                acc = correct / len(val_loader)
                print(f"Accuracy on test set: {acc:.4f}")
                wandb.log({"test_accuracy": acc})

                if acc > best_test_acc:
                    best_test_acc = acc
                    print("New best model found")
                    torch.save(model.state_dict(), MODEL_PATH)

    model.load_state_dict(torch.load(MODEL_PATH))

    return pipeline, model

We first prepare the data as described above, then build our train / val data loaders, and then create the model, loss, optimizer and a learning rate scheduler to reduce the learning rate towards the end of the training.

We then run training for NUM_EPOCHS, continously calculating and computing loss. Every few epochs we print the results on the validation set.

Lastly the main function:

def main():
    # Make experiments reproducible
    set_seeds(RANDOM_SEED)

    pipeline, model = train()

    # Load the test dataset for making predictions
    df = prepare_data(PATH_TO_TEST_FILE)
    test_loader, ids = prepare_test_data(df, pipeline)

    # Make predictions on the test set
    model.eval()
    test_predictions = []
    with torch.no_grad():
        for inputs, _ in test_loader:
            _, predidctions = model(inputs)
            test_predictions.extend(predidctions.numpy())

    output_df = pd.DataFrame(
        {
            "PassengerId": ids,
            "Survived": [
                int(x) if not math.isnan(x) else 0 for x in test_predictions
            ],
        }
    )
    output_df.to_csv("predictions.csv", index=False)

if __name__ == "__main__":
    main()

As you can see, we run the training loop, and then use the fitted data pipeline to pre-process the test set. Then we use the trained model to obtain predictions for this test set, and dump these to a csv file. The format for this is specified in the competion description – for this competition we just have two columns: PassengerID, and whether we believe the corresponding passenger has survived or not.

Submitting to Kaggle

How to submit to a competition depends on the competition itself – but in general there are file and notebook submissions. The Titanic competition uses a file submission. Thus, as we have seen above, we need to create a file containing the expected predictions, and submit this.

For this, we can either open the competition website and click the big "Submit to Competition" button, then drag & drop our prediction file onto there. Or, we make use of the kaggle pip package, and do everything from the command line:

kaggle competitions submit -c titanic -f predictions.csv -m "DESCRIPTION"

And that's it. Congratulations! You have now completed your first Kaggle submission, and will show up on the "Titanic" leaderboard.

Results

Upon executing the previous steps, a new submission is generated in Kaggle, and we can see it as well as the corresponding score under the "Submissions" tab. If you executed exactly the code above, since we use a fixed random seed, you should get the exact same result as me: 0.79186 – meaning we predicted with around 80% accuracy who has died, and who were the lucky survivors. Not bad, huh?

Let's see how we stack up against others – for that we jump to the "Leaderboard" tab. Here, we see that we rank at spot 1160 – out of 15904 (on March 21st, 2024) – that is top 7.2%!

Discussion

A few notes on this though, starting with, in my opinion, the most important message: don't stress on this. This is a great competition to get started, but a lower leaderboard position does not mean you did not achieve anything: this competition is really tight, with 0.1% difference in accuracy (one differently classified person) meaning hundreds or thousands of spots on the leaderboard. This post shows the exact code to give you a top 7% ranking – but also I had to try different hyperparameters, random seeds even, to reach this – different seeds resulted in severely lower ratings.

This is essentially true for all Kaggle competitions, but amplified here due to the small scale of the competition: machine learning essentially is about finding the right posterior considering all prior information – a task inherently containing uncertainty. This is split into aleatoric and epistemic uncertainty. The latter comes from the model, and we can reduce it by e.g. more powerful models or more training. The first, aleatoric uncertainty, stems from the data itself – we cannot reduce it. In our case here, we simply don't have all information (or maybe it's an impossible ask) to predict who survived and who died in a catastrophe. There are so many other relevant pieces, as well as luck and chance even.

Lastly: the top solutions for this competition all score 100% – i.e. predict everything correctly. This most likely is done by "leaderboard hacking" – due to the nature of submission, you can simply submit submissions, iteratively changing single predictions – until you reach 100%. And / or: maybe you can use the actual historica data to solve this "riddle".

Conclusion

In this post we used Kaggle's Titanic competition to introduce Kaggle as well as general tips and tricks for solving competitions and submitting to these.

We started by exploring and visualizing the given data for the "Titanic" competition, generating and selecting relevant features. Using these we demonstrated data pre-processing as well as a simple MLP model to achieve the desired task, namely predicting the survivors of the sinking of the Titanic.

Next we showed how to generate the submission file and submit it to Kaggle, scoring top 7.2% on the leaderboard. All code to reproduce this result can be found on Github.

Thanks for reading, and all the best on Kaggle!

NOTE: Kaggle's licensing policy prohibits commercial use, which does not comply with this post. Thus, I have used this version of the Titanic dataset under the MIT license, and distilled the same train / val / test splits. What does this mean for you? Nothing

Tags: Data Science Kaggle Machine Learning Python Software Development