Spring 2025 Data Science Project: Berlin Marathon Analysis¶

Team Members and Contributions¶

Ben Talesnik & Zach Hiner

Contributions:¶

For each team member, list their contributions across the following categories:

  • A: Project idea

    • We both contributed to the project idea due to our interest in the sport of running. We are both fanscinated in marathon running and wanted to analyze the data to see how the sport has evolved over time.
  • B: Dataset Curation and Preprocessing

    • Ben was responsible for curating the dataset and preprocessing the data. He also created the initial data exploration and summary statistics.
  • C: Data Exploration and Summary Statistics

    • Zach was responsible for the gender performance difference, year-over-year performance trends, and a pace analysis. He also created the visualization and result analysis.
  • D: ML Algorithm Design/Development

    • Ben was responsible for the linear regression model and the forecast analysis. He also created the 5th percentile finish time analysis.
  • E: ML Algorithm Training and Test Data Analysis

    • Zach helped code the linear regression model and the forecast analysis. He was able to determine the best fit line for the data and the forecast for the future.
  • F: Visualization, Result Analysis, Conclusion

    • Both of us worked to create the visualization and result analysis. We both worked to create the conclusion and write the tutorial report.
  • G: Final Tutorial Report Creation

    • We both met to write the tutorial report and finalize the project.
  • H: Additional

    • We also both worked together to ensure all of our progress was stored in the GitHub repository.

1. Introduction¶

This project analyzes Berlin Marathon results from 1999 to 2019, one of the world's most prestigious marathon events. Through this analysis, we aim to:

  1. Has the average time to finish the marathon decreased over time?
  2. How has the average time to finish the marathon changed over time for different genders?
  3. How has the average time to finish the marathon changed over time for different countries?

Understanding these patterns is important because it can help us understand how the sport of marathon running has evolved over time. It demonstrates the impact of technology, training, and other factors on marathon performance. With the introduction of carbon plated shoes, new fitness tests, and other factors, it is important to understand how these changes have affected the sport.

2. Data Curation and Preprocessing¶

Data Source¶

Marathon finisher results from the Berlin marathon between 1999 and 2019. Miller, Andrew, Marathon Results, (2021), GitHub Repository,
https://github.com/AndrewMillerOnline/marathon-results

Initial Data Overview¶

The dataset contains information about the runners, their times, and other details from a few of the world's most prestigious marathons. We have chosen to focus on the Berlin Marathon due to the completeness of the dataset. The dataset includes information including:

  • Place
  • Name
  • Nationality
  • Club
  • Time
  • Gender
  • Splits for 5k, 10k, 15k, 20k, 25k, 30k, 35k, 40k
  • Year

This dataset is useful for our analysis because it contains a large amount of information about the runners, including their times, gender, nationality, and club.

In [212]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from scipy import stats
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Set style for all visualizations
sns.set_palette('husl')
plt.style.use("seaborn-v0_8")

# Configure pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 10)

Data Loading and Preprocessing¶

The dataset overall is very clean and does not require much preprocessing. With the initial import, we're focusing on ensuring the current years are imported (1999-2019). With each year added to the dataframe, a year column is created to keep track of the year of the marathon.

In [213]:
START_YEAR = 1999
END_YEAR = 2019

# Create a list of dataframes
dfs = []
for year in range(START_YEAR, END_YEAR + 1):
    df = pd.read_csv(f'./data/results-{year}.csv')
    df['year'] = year
    dfs.append(df)

# Concatenate all dataframes
results = pd.concat(dfs, ignore_index=True)

# Drop rows with missing values, even though we know there are none
results = results.dropna()

display(results)
place_overall first_name last_name nationality club time_full gender split_5k split_10k split_15k split_20k time_half split_25k split_30k split_35k split_40k Race year
0 1 Josephat Kiprono KEN Kenia 02:06:44 M 00:15:09 00:30:21 00:45:25 01:00:34 01:03:54 01:14:54 01:30:10 01:45:05 02:00:08 Berlin 1999
1 2 Takayuki Inubushi JPN Japan 02:06:57 M 00:15:09 00:30:21 00:45:27 01:00:36 01:03:54 01:14:56 01:30:18 01:45:21 02:00:18 Berlin 1999
2 3 Samson Kandie KEN Kenia 02:08:31 M 00:15:10 00:30:21 00:45:27 01:00:35 01:03:54 01:14:55 01:30:18 01:45:59 02:01:41 Berlin 1999
3 4 Hicham Chatt MAR Marokko 02:09:56 M 00:15:11 00:30:23 00:45:27 01:00:36 01:03:55 01:14:56 01:30:24 01:46:14 02:02:52 Berlin 1999
4 5 Henry Cherono KEN Kenia 02:10:37 M 00:15:10 00:30:21 00:00:00 01:00:35 01:03:55 01:14:55 01:30:18 01:46:14 02:02:58 Berlin 1999
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
678702 30739 Wolfgang Biedermann GER SC . LT SAD ,Spiridon 07:12:30 M 00:39:11 01:26:52 02:15:21 03:09:25 03:22:13 04:06:45 05:04:06 05:56:57 06:49:33 Berlin 2019
678705 13243 Elvira Ernst GER 261 Fearless 07:14:44 W 00:46:10 01:34:15 02:23:21 03:12:21 03:22:44 04:02:34 04:56:52 05:50:44 06:48:44 Berlin 2019
678706 30740 Gert Kleinschmidt GER Dessauer SV 97 07:15:28 M 00:49:05 01:42:38 02:38:02 03:27:30 03:37:40 04:16:57 05:09:49 06:01:58 06:54:25 Berlin 2019
678709 30741 Oscar Martinez USA BRIDERUNNERS 07:19:28 M 00:37:26 01:18:14 02:05:06 02:59:24 03:10:51 03:56:15 04:58:07 06:01:29 06:56:38 Berlin 2019
678710 30742 Dietmar John GER Bayerische Laufzeitung-Der Che 07:26:09 M 00:42:11 01:29:17 02:21:20 03:13:21 03:24:52 04:09:27 05:07:35 06:04:09 07:01:16 Berlin 2019

290390 rows × 18 columns

Now we need to convert the time column to a more usable format. We need to go through each of the time columns and convert them to a datetime object. Each time is converted to minutes for easier analysis.

In [214]:
# Convert time columns to datetime
time_columns = ['time_full', 'split_5k', 'split_10k', 'split_15k', 'split_20k', 
                'time_half', 'split_25k', 'split_30k', 'split_35k', 'split_40k']

for col in time_columns:
    dt = pd.to_datetime(results[col], format='%H:%M:%S')
    results[f'{col}_minutes'] = (
        dt.dt.hour * 60 +
        dt.dt.minute +
        dt.dt.second / 60
    )

# Remove any rows where the time is under 100 minutes
results = results[results['time_full_minutes'] >= 100]

3. Exploratory Data Analysis¶

In this section, we'll explore key characteristics of our dataset through summary statistics and visualizations.

Summary Statistics¶

In [215]:
# Basic information about the dataset
display("Dataset Shape:", results.shape)
display("Columns:", results.columns.tolist())
display("Data Types:", results.dtypes)
display("Missing Values:", results.isnull().sum())
display("Mean:", results['time_full_minutes'].mean())
display("Median:", results['time_full_minutes'].median())
'Dataset Shape:'
(290389, 28)
'Columns:'
['place_overall',
 'first_name',
 'last_name',
 'nationality',
 'club',
 'time_full',
 'gender',
 'split_5k',
 'split_10k',
 'split_15k',
 'split_20k',
 'time_half',
 'split_25k',
 'split_30k',
 'split_35k',
 'split_40k',
 'Race',
 'year',
 'time_full_minutes',
 'split_5k_minutes',
 'split_10k_minutes',
 'split_15k_minutes',
 'split_20k_minutes',
 'time_half_minutes',
 'split_25k_minutes',
 'split_30k_minutes',
 'split_35k_minutes',
 'split_40k_minutes']
'Data Types:'
place_overall          int64
first_name            object
last_name             object
nationality           object
club                  object
                      ...   
time_half_minutes    float64
split_25k_minutes    float64
split_30k_minutes    float64
split_35k_minutes    float64
split_40k_minutes    float64
Length: 28, dtype: object
'Missing Values:'
place_overall        0
first_name           0
last_name            0
nationality          0
club                 0
                    ..
time_half_minutes    0
split_25k_minutes    0
split_30k_minutes    0
split_35k_minutes    0
split_40k_minutes    0
Length: 28, dtype: int64
'Mean:'
np.float64(241.469336476244)
'Median:'
np.float64(237.23333333333332)

As we can see, the dataset is very clean and does not require much preprocessing due to the fact that there are no missing values. You can also see the columns that are time columns and have been converted to minutes while preserving the original time format. The mean and median are very close, which indicates that the dataset is not skewed. The average time to finish the marathon is 241.47 minutes (4 hours and 1 minute), with a median of 237.23 minutes (3 hours and 57 minutes).

We have chosen to look at a few different aspects of the dataset. We will be looking at the distribution of finish times, the gender distribution, and the correlation between splits.

With further statistical analysis, we will look at the gender performance difference, year-over-year performance trends, and a pace analysis.

Distribution of Finish Times¶

In [216]:
plt.figure(figsize=(12, 6))
sns.histplot(data=results, x='time_full_minutes', bins=50)
plt.title('Distribution of Marathon Finish Times, Berlin (1999-2019)')
plt.xlabel('Finish Time (minutes)')
plt.ylabel('Count')
plt.show()
No description has been provided for this image

This distribution of Marathon Finish Times from the Berlin Marathon 1999-2019 is right-skewed. It shows the mean of the distribution is around 4 hours. Elite runners finish quickly around 2 to 2.5 hours, creating a sharp peak at the left end. A smaller group takes 6+ hours due to exhaustion, injuries, or walking, forming the right tail.

Distribution of Genders¶

In [217]:
gender_counts = results['gender'].value_counts()

plt.figure(figsize=(8, 6))
plt.pie(gender_counts, labels=gender_counts.index, autopct='%1.1f%%')
plt.title('Gender Distribution in Marathon Participants, Berlin (1999-2019)')
plt.show()
No description has been provided for this image

The gender distribution shows that there are way more male participants than female participants. Between 1999 and 2019, 79% of participants were male while only 21% were female.

Correlation Between Splits¶

In [218]:
split_columns = [col for col in results.columns if 'minutes' in col]
correlation_matrix = results[split_columns].corr()

plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation between Marathon Splits, Berlin (1999-2019)')
plt.show()
No description has been provided for this image

The correlation heatmap shows a strong positive relationship between marathon split times, meaning runners who start fast tend to maintain their pace. Earlier splits are highly correlated, while later splits show slightly weaker correlations due to fatigue or pacing strategies. Elite runners likely have the strongest correlations, as they maintain consistent pacing. Overall, the heatmap highlights the predictable relationship between split times and overall marathon performance.

Statistical Analysis 1: Gender Performance Difference¶

In [219]:
# Compare finish times between genders
male_times = results[results['gender'] == 'M']['time_full_minutes']
female_times = results[results['gender'] == 'W']['time_full_minutes']

# Calculate basic statistics
male_mean = male_times.mean()
female_mean = female_times.mean()
male_median = male_times.median()
female_median = female_times.median()

# Perform t-test to check if the difference is statistically significant
t_stat, pvalue = stats.ttest_ind(male_times, female_times, equal_var=False)

# Visualize the distributions with a histogram
plt.figure(figsize=(12, 6))
sns.histplot(male_times, color='blue', alpha=0.5, label='Male', kde=True)
sns.histplot(female_times, color='red', alpha=0.5, label='Female', kde=True)
plt.axvline(male_mean, color='blue', linestyle='--', label=f'Male Mean: {male_mean:.2f} min')
plt.axvline(female_mean, color='red', linestyle='--', label=f'Female Mean: {female_mean:.2f} min')
plt.title('Distribution of Marathon Finish Times by Gender, Berlin (1999-2019)')
plt.xlabel('Finish Time (minutes)')
plt.ylabel('Count')
plt.legend()
plt.show()

display("Statistical Analysis 1: Gender Performance Difference")
display(f"Male Mean: {male_mean:.2f} minutes")
display(f"Female Mean: {female_mean:.2f} minutes")
display(f"Difference: {abs(male_mean - female_mean):.2f} minutes")
display(f"Male Median: {male_median:.2f} minutes")
display(f"Female Median: {female_median:.2f} minutes")
display(f"T-statistic: {t_stat:.4f}")
display(f"P-value: {pvalue:.4f}")
No description has been provided for this image
'Statistical Analysis 1: Gender Performance Difference'
'Male Mean: 235.74 minutes'
'Female Mean: 263.07 minutes'
'Difference: 27.33 minutes'
'Male Median: 232.02 minutes'
'Female Median: 260.12 minutes'
'T-statistic: -146.8780'
'P-value: 0.0000'

This statistical analysis shows a significant gender performance difference in marathon times, with males finishing faster on average. The mean male finish time is 235.73 minutes, while the female mean is 263.07 minutes, a 27.33 minute gap. The medians follow a similar pattern, reinforcing the trend. The extremely low p-value (0.0000) and large negative t-statistic (-146.88) indicate that this difference is highly statistically significant, meaning it is unlikely due to chance.

Statistical Analysis 2: Year-over-Year Performance Trend¶

In [220]:
yearly_means = results.groupby('year')['time_full_minutes'].mean()

# Visualize the trend with a line plot
plt.figure(figsize=(12, 6))
yearly_means.plot(kind='line', marker='o')
plt.title('Average Marathon Finish Time by Year, Berlin (1999-2019)')
plt.xlabel('Year')
plt.ylabel('Average Finish Time (minutes)')
plt.grid(True)
plt.show()

# Perform linear regression
slope, intercept, r_value, p_value, std_err = stats.linregress(yearly_means.index, yearly_means.values)

display("Statistical Analysis 2: Year-over-Year Performance Trend")
display(f"Slope: {slope:.4f}")
display(f"R-squared: {r_value**2:.4f}")
display(f"P-value: {p_value:.4f}")
No description has been provided for this image
'Statistical Analysis 2: Year-over-Year Performance Trend'
'Slope: 0.5230'
'R-squared: 0.3861'
'P-value: 0.0034'

This analysis examines the year-over-year trend in marathon finish times from 1999 to 2019. The positive slope (0.5229) suggests that average finish times have been gradually increasing, meaning runners are taking slightly longer to complete the marathon over time. We think this may be due to growth in marathon popularity as more casual runners and first-timers are participating, which could be pulling the average finish time higher. The R-squared value (0.3860) indicates a moderate correlation, meaning other factors also influence performance. The low p-value (0.0035) confirms that the trend is statistically significant and unlikely due to random chance. This suggests possible influences like changes in race conditions, participation demographics, or training methods over the years.

Statistical Analysis 3: Pace Analysis¶

In [221]:
# Calculate pace (minutes per kilometer) for each split
for i in range(1, len(split_columns)):
    split_distance = 5  # Each split is 5km
    results[f'pace_{i}'] = (results[split_columns[i]] - results[split_columns[i-1]]) / split_distance

# Box plot of pace distribution across splits
plt.figure(figsize=(12, 6))
pace_columns = [col for col in results.columns if 'pace_' in col]
results[pace_columns].boxplot()
plt.title('Pace Distribution Across Marathon 5k Splits, Berlin (1999-2019)')
plt.xlabel('Split Number')
plt.ylabel('Pace (minutes per kilometer)')
plt.show()

# Perform ANOVA test for pace differences across splits
pace_data = [results[col].dropna() for col in pace_columns]
f_stat, p_value = stats.f_oneway(*pace_data)

display("Statistical Analysis 3: Pace Analysis")
display(f"F-statistic: {f_stat:.4f}")
display(f"P-value: {p_value:.4f}")
No description has been provided for this image
'Statistical Analysis 3: Pace Analysis'
'F-statistic: 6325841.9236'
'P-value: 0.0000'

This analysis explores the pace distribution across different 5k splits of the marathon from 1999 to 2019. By calculating the pace for each split, it provides insight into how runners' speeds change throughout the race. The box plot visually displays the distribution of paces across splits, showing that runners start faster and gradually slow down, especially as fatigue sets in. The ANOVA test results with an F-statistic of 6325419.9672 and a p-value of 0.0000 confirms that there are significant differences in pace across the splits. This suggests that runners generally run faster in the earlier splits and slow down in the later ones, which is typical in endurance races such as marathons due to factors like fatigue, pacing strategy, and race conditions. The very low p-value indicates that these differences in pace are statistically significant and not due to random chance.

4. & 5. Primary Analysis / Visualization & Results¶

Based on the results of our exploration, we have chosen to use a linear regression model to predict how the average and 5th percentile finish times have changed over time. We believe that the 5th percentile finish time is a more accurate representation of the performance of the fastest runners, and therefore more useful for predicting the future.

Linear Regression Model¶

We will be using a linear regression model to predict the finish time of a runner based on their gender. A linear regression model is used in data science to predict the relationship between a dependent variable and one or more independent variables. In this case, the dependent variable is the finish time of a runner, and the independent variable is their gender.

Let's first create a copy of the results dataframe to work with and standardize the gender column.

In [222]:
df = results.copy()
df['gender_code'] = df['gender'].str.strip().str.upper().str[0]

Now, we will compute the average finish time per year for each gender.

In [223]:
avg_overall = (
    df
    .groupby('year', as_index=False)['time_full_minutes']
    .mean()
)
avg_male = (
    df[df['gender_code']=='M']
    .groupby('year', as_index=False)['time_full_minutes']
    .mean()
)
avg_female = (
    df[df['gender_code']=='W']
    .groupby('year', as_index=False)['time_full_minutes']
    .mean()
)

display("Average Finish Time per Year", avg_overall)
display("Average Finish Time per Year for Males", avg_male)
display("Average Finish Time per Year for Females", avg_female)
'Average Finish Time per Year'
year time_full_minutes
0 1999 230.064991
1 2000 240.524645
2 2001 233.795560
3 2002 232.986569
4 2004 235.138728
... ... ...
15 2015 240.178294
16 2016 243.868154
17 2017 244.849287
18 2018 251.155402
19 2019 243.218337

20 rows × 2 columns

'Average Finish Time per Year for Males'
year time_full_minutes
0 1999 226.954815
1 2000 236.320251
2 2001 229.583860
3 2002 229.875422
4 2004 230.189993
... ... ...
15 2015 234.356227
16 2016 237.275377
17 2017 237.856427
18 2018 243.476952
19 2019 234.839135

20 rows × 2 columns

'Average Finish Time per Year for Females'
year time_full_minutes
0 1999 248.983984
1 2000 261.797333
2 2001 256.749426
3 2002 255.008837
4 2004 256.924887
... ... ...
15 2015 260.288286
16 2016 264.953039
17 2017 264.489884
18 2018 270.961225
19 2019 265.641949

20 rows × 2 columns

Now, we will fit a linear regression model to the data. We will use the year as the independent variable and the average finish time as the dependent variable.

In [224]:
# Fit linear trend and forecast 2020–2050
def fit_and_forecast(df, start=2020, end=2050):
    if len(df) < 2:
        return pd.DataFrame({'year': [], 'pred': []})
    X = df[['year']].values
    y = df['time_full_minutes'].values
    model = LinearRegression().fit(X, y)
    future_years = np.arange(start, end + 1)
    preds = model.predict(future_years.reshape(-1, 1))
    return pd.DataFrame({'year': future_years, 'pred': preds})

fc_overall = fit_and_forecast(avg_overall)
fc_male    = fit_and_forecast(avg_male)
fc_female  = fit_and_forecast(avg_female)

Now that we have the forecasted values, we can plot them.

In [225]:
# Plot each series on its own figure
series = [
    ('Overall Average', avg_overall, fc_overall),
    ('Male Average',    avg_male,    fc_male),
    ('Female Average',  avg_female,  fc_female)
]

for title, hist, fc in series:
    plt.figure(figsize=(8, 4))
    plt.plot(hist['year'], hist['time_full_minutes'],
             marker='o', linestyle='-', label='Historical')
    if not fc.empty:
        plt.plot(fc['year'], fc['pred'],
                 marker='x', linestyle='--', label='Forecast')
    plt.title(f"{title} Finish Time")
    plt.xlabel("Year")
    plt.ylabel("Average Finish Time (min)")
    plt.legend()
    plt.grid(alpha=0.3)
    plt.tight_layout()
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

You can see that the forecasted values are very close to the historical values, which is a good sign. The forecasted values are a little higher than the historical values, which is expected. Let's now look at the forecasted values for the 10 years following 2019.

In [226]:
forecast_df = None
for fc, key in [(fc_overall, 'overall'),
                (fc_male,    'male'),
                (fc_female,  'female')]:
    col = f'pred_{key}'
    df_tmp = fc.rename(columns={'pred': col})
    if forecast_df is None:
        forecast_df = df_tmp
    else:
        forecast_df = forecast_df.merge(df_tmp, on='year', how='outer')

display(forecast_df.head(11))
year pred_overall pred_male pred_female
0 2020 246.700278 239.066187 267.355158
1 2021 247.223248 239.390388 267.846358
2 2022 247.746218 239.714589 268.337559
3 2023 248.269189 240.038790 268.828760
4 2024 248.792159 240.362990 269.319960
... ... ... ... ...
6 2026 249.838100 241.011392 270.302362
7 2027 250.361071 241.335593 270.793562
8 2028 250.884041 241.659793 271.284763
9 2029 251.407012 241.983994 271.775964
10 2030 251.929982 242.308195 272.267164

11 rows × 4 columns

As you can see, the predicted values are slightly increasing over time, which is not what we initially expected. As running technology has improved, we would expect the average finish time to decrease over time. However, this is not the case.

After further thought, we believe that the reason for this is the increase in popularity of the marathon. Since Berlin uses a lottery system as the main way to select participants, the times are not as competitive as a marathon like the Boston Marathon which requires a qualifying time.

With more people wanting to run the marathon, the average finish time has increased.

Now, let's look at the 5th percentile finish time. This captures the performance of the fastest runners, and therefore more useful for predicting the future of the elite runners.

In [227]:
# Compute 5th-percentile finish time per year
p5_overall = (
    df
    .groupby('year', as_index=False)['time_full_minutes']
    .quantile(0.05)
    .rename(columns={'time_full_minutes':'time_5pct'})
)
p5_male = (
    df[df['gender_code']=='M']
    .groupby('year', as_index=False)['time_full_minutes']
    .quantile(0.05)
    .rename(columns={'time_full_minutes':'time_5pct'})
)
p5_female = (
    df[df['gender_code']=='W']
    .groupby('year', as_index=False)['time_full_minutes']
    .quantile(0.05)
    .rename(columns={'time_full_minutes':'time_5pct'})
)

display("5th Percentile Finish Time per Year", p5_overall)
display("5th Percentile Finish Time per Year for Males", p5_male)
display("5th Percentile Finish Time per Year for Females", p5_female)
'5th Percentile Finish Time per Year'
year time_5pct
0 1999 174.600000
1 2000 178.503333
2 2001 176.500000
3 2002 172.883333
4 2004 175.809167
... ... ...
15 2015 174.815833
16 2016 175.596667
17 2017 176.086667
18 2018 176.410000
19 2019 171.783333

20 rows × 2 columns

'5th Percentile Finish Time per Year for Males'
year time_5pct
0 1999 173.366667
1 2000 177.485000
2 2001 175.033333
3 2002 171.716667
4 2004 173.750000
... ... ...
15 2015 171.958333
16 2016 172.193333
17 2017 172.290833
18 2018 172.530000
19 2019 167.253333

20 rows × 2 columns

'5th Percentile Finish Time per Year for Females'
year time_5pct
0 1999 198.833333
1 2000 204.840833
2 2001 204.325000
3 2002 196.910000
4 2004 201.631667
... ... ...
15 2015 199.773333
16 2016 201.045833
17 2017 200.101667
18 2018 199.150000
19 2019 197.733333

20 rows × 2 columns

Now, we will fit a linear regression model to the 5th percentile finish time.

In [228]:
# Helper to fit & forecast
def fit_and_forecast(df, label, start=2020, end=2050):
    if len(df) < 2:
        return pd.DataFrame({'year':[], f'pred_{label}':[]})
    X = df[['year']].values
    y = df['time_5pct'].values
    model = LinearRegression().fit(X, y)
    future = np.arange(start, end+1)
    preds = model.predict(future.reshape(-1,1))
    return pd.DataFrame({'year':future, f'pred_{label}':preds})

fc_overall = fit_and_forecast(p5_overall, 'overall')
fc_male    = fit_and_forecast(p5_male,    'male')
fc_female  = fit_and_forecast(p5_female,  'female')

We can now plot the fit and forecast for the 5th percentile finish time.

In [229]:
# Plot each on its own figure
series = [
    ('Overall 5th-%ile', p5_overall, fc_overall),
    ('Male 5th-%ile',    p5_male,    fc_male),
    ('Female 5th-%ile',  p5_female,  fc_female),
]

for title, hist, fc in series:
    plt.figure(figsize=(8,4))
    plt.plot(hist['year'], hist['time_5pct'],
             marker='o', linestyle='-', label='Historical')
    if not fc.empty:
        plt.plot(fc['year'], fc[f'pred_{title.split()[0].lower()}'],
                 marker='x', linestyle='--', label='Forecast')
    plt.title(f"{title} Finish Time")
    plt.xlabel("Year")
    plt.ylabel("5th Percentile Finish Time (min)")
    plt.legend()
    plt.grid(alpha=0.3)
    plt.tight_layout()
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

We can see that the forecasted values are now decreasing over time, which is what we initially expected.

In [230]:
forecast_df = None
for fc, key in [
    (fc_overall, 'overall'),
    (fc_male,    'male'),
    (fc_female,  'female')
]:
    if forecast_df is None:
        forecast_df = fc.copy()
    else:
        forecast_df = forecast_df.merge(fc, on='year', how='outer')

display(forecast_df.head(11))
year pred_overall pred_male pred_female
0 2020 176.409344 172.740013 201.315839
1 2021 176.341105 172.542768 201.175376
2 2022 176.272866 172.345522 201.034913
3 2023 176.204627 172.148277 200.894450
4 2024 176.136389 171.951031 200.753987
... ... ... ... ...
6 2026 175.999911 171.556540 200.473062
7 2027 175.931672 171.359294 200.332599
8 2028 175.863433 171.162048 200.192136
9 2029 175.795194 170.964803 200.051673
10 2030 175.726955 170.767557 199.911210

11 rows × 4 columns

The model predicts that the 5th percentile finish time will continue to decrease over time, which is what we initially expected. We found it most interesting that the model predicts that the 5th percentile of men could finish in just 2 hours and 10 minutes by 2030.

Next, we will use a random forest model to predict the 5th percentile finish time by nationality. First, we have to set the parameters for the model.

In [231]:
SPLIT_COLS = [
    'split_5k_minutes', 'split_10k_minutes', 'split_15k_minutes',
    'split_20k_minutes', 'time_half_minutes',  'split_25k_minutes',
    'split_30k_minutes', 'split_35k_minutes', 'split_40k_minutes'
]

Now, let's compute the 5th percentile finish time by year and nationality. After that, we will compute the mean split times by year and nationality. Lastly, we will merge the two dataframes on the year and nationality columns.

In [232]:
# Compute the 5th-percentile finish time by year & nationality
df_p5 = (
    results
    .groupby(['year','nationality'], as_index=False)['time_full_minutes']
    .quantile(0.05)
    .rename(columns={'time_full_minutes':'time_5pct'})
)

# Compute mean split times by year & nationality
mean_splits = (
    results
    .groupby(['year','nationality'], as_index=False)[SPLIT_COLS]
    .mean(numeric_only=True)
)

# Merge the two dataframes on the year and nationality columns
df = pd.merge(df_p5, mean_splits, on=['year','nationality'], how='inner')

display("5th Percentile Finish Time by Year and Nationality", df_p5)
display("Mean Split Times by Year and Nationality", mean_splits)
display("Merged Dataframe", df)
'5th Percentile Finish Time by Year and Nationality'
year nationality time_5pct
0 1999 ALG 197.746667
1 1999 ARG 181.234167
2 1999 AUS 176.785000
3 1999 AUT 168.150000
4 1999 BEL 144.116667
... ... ... ...
1613 2019 VIE 202.687500
1614 2019 WLS 238.513333
1615 2019 YEM 291.883333
1616 2019 ZAM 153.538333
1617 2019 ZIM 289.350000

1618 rows × 3 columns

'Mean Split Times by Year and Nationality'
year nationality split_5k_minutes split_10k_minutes split_15k_minutes split_20k_minutes time_half_minutes split_25k_minutes split_30k_minutes split_35k_minutes split_40k_minutes
0 1999 ALG 27.458333 53.725000 34.233333 103.650000 109.441667 128.716667 155.941667 180.675000 205.825000
1 1999 ARG 0.000000 41.858333 62.775000 83.908333 88.675000 105.308333 128.008333 151.566667 176.675000
2 1999 AUS 22.180952 50.952381 75.569048 99.783333 105.314286 124.195238 150.426190 177.683333 206.145238
3 1999 AUT 16.486777 50.064738 72.923691 98.985399 104.813912 124.610055 151.518113 176.200895 207.681612
4 1999 BEL 17.378836 47.292328 69.284392 93.743386 97.434921 117.550794 142.859524 168.600529 194.963492
... ... ... ... ... ... ... ... ... ... ... ...
1613 2019 VIE 24.350000 48.100000 72.316667 95.541667 100.666667 119.141667 142.841667 166.483333 192.325000
1614 2019 WLS 28.333333 56.900000 85.755556 114.477778 120.655556 143.761111 173.461111 202.966667 233.827778
1615 2019 YEM 31.583333 64.800000 98.333333 132.183333 139.583333 167.016667 202.716667 240.600000 276.733333
1616 2019 ZAM 25.283333 50.983333 77.038889 102.533333 108.111111 128.777778 155.600000 182.977778 212.466667
1617 2019 ZIM 31.766667 62.866667 94.200000 125.516667 132.316667 157.166667 189.900000 226.216667 269.466667

1618 rows × 11 columns

'Merged Dataframe'
year nationality time_5pct split_5k_minutes split_10k_minutes split_15k_minutes split_20k_minutes time_half_minutes split_25k_minutes split_30k_minutes split_35k_minutes split_40k_minutes
0 1999 ALG 197.746667 27.458333 53.725000 34.233333 103.650000 109.441667 128.716667 155.941667 180.675000 205.825000
1 1999 ARG 181.234167 0.000000 41.858333 62.775000 83.908333 88.675000 105.308333 128.008333 151.566667 176.675000
2 1999 AUS 176.785000 22.180952 50.952381 75.569048 99.783333 105.314286 124.195238 150.426190 177.683333 206.145238
3 1999 AUT 168.150000 16.486777 50.064738 72.923691 98.985399 104.813912 124.610055 151.518113 176.200895 207.681612
4 1999 BEL 144.116667 17.378836 47.292328 69.284392 93.743386 97.434921 117.550794 142.859524 168.600529 194.963492
... ... ... ... ... ... ... ... ... ... ... ... ...
1613 2019 VIE 202.687500 24.350000 48.100000 72.316667 95.541667 100.666667 119.141667 142.841667 166.483333 192.325000
1614 2019 WLS 238.513333 28.333333 56.900000 85.755556 114.477778 120.655556 143.761111 173.461111 202.966667 233.827778
1615 2019 YEM 291.883333 31.583333 64.800000 98.333333 132.183333 139.583333 167.016667 202.716667 240.600000 276.733333
1616 2019 ZAM 153.538333 25.283333 50.983333 77.038889 102.533333 108.111111 128.777778 155.600000 182.977778 212.466667
1617 2019 ZIM 289.350000 31.766667 62.866667 94.200000 125.516667 132.316667 157.166667 189.900000 226.216667 269.466667

1618 rows × 12 columns

We've decided to restrict the data to the 20 fastest nationalities by average 5th percentile finish time.

In [233]:
top20 = (
    df.groupby('nationality')['time_5pct']
      .mean()
      .nsmallest(20)
      .index
)
df = df[df['nationality'].isin(top20)].reset_index(drop=True)

To continue, we will one-hot encode the nationality column. This is because the random forest model requires categorical data to be encoded.

In [234]:
nat_dummies = pd.get_dummies(df['nationality'], prefix='nat', drop_first=True)

Since our data is already split into training and test sets, we can directly build the feature matrix and target vector. We can then train the model on the training set and evaluate it on the test set.

In [235]:
# Build feature matrix X and target y
X = pd.concat([ df[['year']], nat_dummies, df[SPLIT_COLS] ], axis=1)
y = df['time_5pct']

# Train/test split (keep test indices for plotting)
X_train, X_test, y_train, y_test, idx_train, idx_test = train_test_split(
    X, y, df.index, test_size=0.2, random_state=42
)

We can now perform hyperparameter tuning via RandomizedSearchCV.

In [236]:
# Hyperparameter tuning via RandomizedSearchCV
param_dist = {
    'n_estimators':    [100, 200, 500],
    'max_depth':       [None, 5, 10, 20],
    'min_samples_leaf':[1, 2, 5],
    'max_features':    ['sqrt', 'log2', 0.5]
}
rf_base = RandomForestRegressor(random_state=42)

search = RandomizedSearchCV(
    rf_base,
    param_dist,
    n_iter=30,
    cv=5,
    scoring='r2',
    n_jobs=-1,
    random_state=42
)
search.fit(X_train, y_train)

display("Best hyperparameters:", search.best_params_)
'Best hyperparameters:'
{'n_estimators': 500,
 'min_samples_leaf': 1,
 'max_features': 0.5,
 'max_depth': 5}

We can now train the model on the training set and evaluate it on the test set.

In [237]:
# Fit final model with best params
rf_final = RandomForestRegressor(
    **search.best_params_, 
    random_state=42,
    n_jobs=-1
)
rf_final.fit(X_train, y_train)

# Evaluate on the test set
y_pred = rf_final.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2  = r2_score(y_test, y_pred)

display(f"Final MSE: {mse:.2f}")
display(f"Final R² : {r2:.3f}")
'Final MSE: 113.53'
'Final R² : 0.671'

We can see that the model had a final MSE of 113.53 (sqrt(113.53) = 10.66 is small relative to the 120-200 range of the 5th percentile finish time) and an R² of 0.671. Even though the model is not perfect, it is still a good model.

Now, we can plot the actual vs predicted 5th percentile finish time by nationality.

In [238]:
# Prepare DataFrame for plotting colored by nationality
df_vis = pd.DataFrame({
    'actual':      y_test,
    'predicted':   y_pred,
    'nationality': df.loc[idx_test, 'nationality']
})

# Plot Actual vs Predicted
plt.figure(figsize=(10, 6))
for nat, grp in df_vis.groupby('nationality'):
    plt.scatter(grp['actual'], grp['predicted'], alpha=0.7, label=nat)
mn, mx = df_vis[['actual','predicted']].min().min(), df_vis[['actual','predicted']].max().max()
plt.plot([mn, mx], [mn, mx], 'k--', lw=1)
plt.xlabel("Actual 5th-Percentile Time (min)")
plt.ylabel("Predicted 5th-Percentile Time (min)")
plt.title("RandomForest Predictions by Nationality (Top 20)")
plt.legend(bbox_to_anchor=(1.05,1), loc='upper left')
plt.tight_layout()
plt.show()
No description has been provided for this image

The graph has some interesting outliers, but overall, the model is able to predict the 5th percentile finish time by nationality fairly well. Nations with traditionally strong depth (notably Kenya, Ethiopia, etc.) tend to show faster mid-race splits and the model nails their 5th percentile finish time more precisely. Smaller, less developed nations like Zimbabwe and Guam show slower mid-race splits and a higher spread in predicted 5th percentile finish times but the forest still captures the general trend.

Lastly, we can plot the top 20 feature importances.

In [239]:
# Plot Top 20 Feature Importances
importances = pd.Series(
    rf_final.feature_importances_,
    index=X.columns
).nlargest(20).sort_values(ascending=True)

plt.figure(figsize=(6,8))
importances.plot(kind='barh')
plt.title("Top 20 Feature Importances")
plt.xlabel("Importance")
plt.tight_layout()
plt.show()
No description has been provided for this image

The top predictors are the mid-to late-race splits, especially the 25km and 10km marks and half-marathon time. This tells us that how a national cohort performs is highly dependent on how they perform in the second half of the race. Year is also a strong predictor, which is expected. Nationality dummies are also within the top 20, meaning that once we have accounted for split performance, simply being from a certain country can have a relative impact on performance.

The fact that splits dominate over nationality suggests that in-race pacing data is the best predictor for how deep the field will be, much more so than the nationality of the runners.

6. Insights and Conclusions¶

Our analysis of the Berlin Marathon data from 1999 to 2019 has revealed several key insights:

  1. There is a statistically significant gender difference in marathon performance. Males finish the marathon notably faster than females, with a mean time of 3.92 hours for males compared to 4.38 hours for females — a gap of 27.33 minutes. The difference is reinforced by similar disparities in the medians. A t-statistic of -146.88 and an extremely low p-value (< 0.0001) confirm that this result is highly statistically significant and unlikely due to random variation.

  2. Runners tend to slow down over the course of the marathon. Analysis of the 5k split paces from 1999 to 2019 shows a clear trend: participants start the race at a faster pace and gradually slow down in the later segments. This pacing pattern is typical in long-distance running, where early energy and adrenaline lead to quicker initial splits, followed by a steady decline in speed as fatigue sets in. The box plot of paces across the race confirms this slowdown, with later splits showing slower and more variable times. This trend highlights the physical and mental challenges of maintaining a consistent pace throughout the marathon.

  3. The 5th percentile finish time is a more accurate representation of the performance of the fastest runners, and therefore more useful for predicting the future. The split times are a better predictor of performance than the nationality of the runners.

Limitations and Future Work¶

While our analysis provides valuable insights, there are some limitations to consider:

  1. The lack of individual context or conditions. The dataset does not account for personal factors such as age, injuries, training history, weather conditions on race day, or pacing strategies, all of which can significantly impact performance and pace changes.

  2. The dataset focuses on the Berlin Marathon only, which may not fully represent broader trends across other marathons or regions. Results might vary significantly in different locations or under different race conditions.

  3. The dataset does not account for the fact that the Berlin Marathon is a lottery race, which means that the runners are not all the same. This is why we see such a large disparity in the performance of the runners from different countries.

Future work could address these limitations by:

  1. Future analyses could include personal factors such as age, health status, training regimes, and specific race day conditions to better understand their impact on performance and pacing patterns.

  2. By analyzing marathon data from multiple locations and years, it would be possible to compare trends across different environments, and analyze how similar running patterns are across multiple races and whether outside factors such as location have a significant impact.

References¶

  1. You can learn more about the Berlin Marathon here.