Statistical Machine Learning - NBA Positional Classification | Alex Salce

NBA Player Positional Classification using Statline Data

Background (Abstract)

I've added Reference Resources for details on basketball statistics, game rules/positional details, etc. if any terminology or statistics are unclear.

For my first machine learning project, I was eager to apply techniques to real NBA data. Since I wanted to start with something simple, I thought it would be fun to see how well a model could predict players' positions based only on a stat line from a standalone game.

problem statement

The data I used was compiled using the nba_api Python client, and was the perfect resource for this project. It allows users to access the live NBA databases for a plethora of statistics made publicly available for free.

As it turned out, one of the more substantial efforts was simply assembling and cleaning the data that I hoped to analyze, and I have provided the code that I used in the Reproducibility section of this post.

The data that I assembled was filtered by individual seasons from 2017-2023 to include only currently active player performance features while excluding any physically identifying features and scaling any enumerated statistics (i.e. non-percentage statistics) by time played in the game.

After a position was selected, a response data set corresponding to the season performance data was generated based upon whatever position was of interest for the analysis (Center, Power Forward, Small Forward, Shooting Guard, Point Guard). The response array was then simplified to a 0/1 binary classification problem, where a 1 indicates the position of interest and a 0 indicates any other position.

This analysis was carried out for each position, and the results generally yielded good performance and intuitive results using Logistic Regression and Linear Discriminant Analysis, which we will explore in greater detail in this blog.

Data

Reference terminology section for additional details

NBA Data

In the age of analytics, the NBA has become a sport that has embraced and invested in the acquisition of a deluge of data. Much of the advanced statistics that are kept by the NBA and are publicly accessible. For my purposes, I was most interested in data for in-game individual player performances, particularly any data that went beyond a standard player boxscore.

I used the nba_api, a publicly available Python package, which gave access to the NBA's real time stats databases. The package had several data sources for advanced stats, and I ultimately utilized the following endpoints from the package to acquire the data I was after.

PlayerGameLog
- Standard player boxscore data for all games played by an input player
BoxScoreHustleV2
- "Hustle" stats for all players are returned for an input game
BoxScoreAdvancedV3
- "Advanced" stats for all players are returned for an input game
BoxScorePlayerTrackV3
- "Player tracking" stats for all players are returned for an input game

Each endpoint required different inputs, which ultimately meant that I needed to be thoughtful aboout how I queued the data, and that I would also need to do separate data collection operations. Additionally, I would need to join and clean up the data at the end of the process to consolidate it into one usable dataset.

These endpoints returned plenty of useable statistics. Some were redundant, and others needed to be removed manually to stay consistent with divorcing the data from any physical descriptors of the players themselves (height, weight, etc.). In general, though, there were plenty of statistical categories, or features, to work with.

After the collection, joining, and cleanup of the data, the final dataset was a CSV file containing all of the currently active player individual game stat lines, amounting to about 126,000 individual player game statlines and 80MB of data. The final filtered dataset can be found here.

Data Refinement

After the CSV was generated, I imported into $R$ , where additional data refinement would be performed prior to generating the models. The block of code below details the Import data section of the MATH574M Final Project SALCE R Code.Rmd found in the R Scripts section.

library(lubridate) #used to convert time played to seconds
# OMITTED: basic filtering of non-performance
# features and other redundancies, also removing 4
# seasons prior to 2017 NOTE: 'player.stats' is
# dataframe of filtered data 

# convert 'minutes' column to 'seconds
player.stats$minutes <- period_to_seconds(ms(format(as.POSIXct(parse_date_time(x =
↪ player.stats$minutes,
c("HMS", "MS"))), format = "%M:%S")))

# REMOVE any rows with no playing time
player.stats <- player.stats[-which(player.stats$MIN ==
0), ]

# any nans
player.stats <- na.omit(player.stats)
colnames(player.stats)[grep("minutes", colnames(player.stats))] <- "seconds"

# OMITTED: additional redundant data filtering

# partition enumerated stats to divide by time
# played in game '...' omitted for cleanliness
stats.timenormal.headers <- c("FGM", "FGA", "FG3M",
"...")
stats.scaleby.time <- player.stats[, which(names(player.stats) %in%
stats.timenormal.headers)]

# stats scaled by time; this dataframe is
# combined with non-enumerated stats in test and
# train dataframe gen function
player.stats.timenormal <- stats.scaleby.time/player.stats[,
grep("seconds", colnames(player.stats))]

In summary, the additional refinement consists of

Removal of data prior to 2017 (not all stats were available until then)
Removing games in which player logged no playing time
Conversion of playing time to seconds
Scaling enumerated (non-percentage) stats by seconds played
Remove any obviously redundant features

The last step is an effective "normalization" of stats to a per second format so that all model training and test data are using the same scale.

The refined data is then stored in the dataframe 'player.stats'.

Features

After all data refinement, $p=76$ features remained.

19 standard boxscore stats
22 “advanced” stats
15 “hustle” stats
20 “player tracking” stats

Training and Test Data

Training and Test data

To keep the model simple, I used individual full seasons of data for training data. I had 6 full seasons worth of data and one partial season (the data was collected in November 2023, so the 2023-24 season had some partial data). I decided to use individual seasons for training data.

This, I should note, may impact how the model performs when using other season data as inputs for prediction; the NBA has changed year by year in terms of style of play, so it may influence the model in some unseen ways. Part of what I had originally hoped to accomplish was to reveal any changes in style of play over time, however that level of fidelity ultimately was not served well by the models this ultimately produced. In retrospect, random shuffling of data may be a better approach to remove this potential bias in the model for future efforts.

The below chunk is an excerpt of the function used to generate the dataframes that would be used for test and training data.

data.fn <- function(player.stats, player.stats.timenormal, seasons,
    year, position, binary, threeclass) {

    # filter data by year
    stats.player.year <- player.stats[player.stats$SEASON_ID ==
        seasons[(2024 - year), ], ]
    stats.player.year.timenormal <- player.stats.timenormal[player.stats$SEASON_ID ==
        seasons[(2024 - year), ], ]

    # select all non-player characteristic data
    x <- stats.player.year[, 24:ncol(stats.player.year)]

    # remove the columns that will be replaced by time
    # scaled data
    x <- x[, -which(names(x) %in% stats.timenormal.headers)]

    # combine raw stats with time scaled stats
    x <- data.frame(x, stats.player.year.timenormal)

    # get response data
    y <- data.frame(stats.player.year$POSITION)

    # set selected position to 1 for response array
    if (binary == TRUE) {
        # Center
        if (position == 5) {
            y[y == positions[2, ]] <- 1
            y.cv[y.cv == positions[2, ]] <- 1
        }
        #Center-Forward and Forward-Center ...OMIT other
        # IF statements
        # set others to 0
        y[y != 1] <- 0
        y <- mutate_all(y, function(x) as.numeric(as.character(x)))
    }
    out.data <- list(y, x)
    return(out.data)
}

And an example script to generate a training training set (note, test dataframes were generated with the same function/process, see script).

# select year and position
train.year <- 2020
pos <- 3
# run data gen function
train = data.fn(player.stats, player.stats.timenormal, seasons,
    train.year, pos, TRUE, FALSE)
# generate training data and response
x.train = train[[2]]
y.train = train[[1]]
# TEST data generated similarly using for loop to gen data
# for other seasons

Data Summary Statistics

Dataframe samples and features

Dataframe	$n$	$p$	Description
nbaplayerdata	126742	122	full dataset captured from nba api
player.stats	126436	99	player stats post R processing
player.stats.timenormal	126436	45	filtered and time scaled stats
x.train x.test	see season data dimensions	76	test and train data

Train and Test dataframes by season

Season	$n$
2023-24	3255
2022-23	25774
2021-22	22704
2020-21	18125
2019-20	15836
2018-19	15507

Features breakdown

Stat.Category	$p$
Traditional Boxscore	19
Advanced Boxscore	22
Hustle Stats	15
Player Tracking Stats	20

Model Selection

My original motivation was to determine how well a model could differentiate a player's position from the other positions based solely upon performance statistics from a standalone game. Additionally, I wanted the model to have interpretable results, particularly the ability to extract the features that matter to predictions (so, the model should have some feature selection capability). Since I have intuition about the sport, having interpretable results would give me insights for whether the model is predicting using features that I believe to make sense, or if there is anything unseen by my own intuition for the game that actually matters in describing a positional player by performance.

Since this is a classification problem, I elected to keep things simple by approaching the models as a binary classification problem rather than multiclass. In future efforts, multiclass analysis may be worthwhile, but it was important to me to start small to ensure I could make some sense of the results. Since my goal was for the model to be interpretable and return information about features as well, any candidate models should be supervised learning methods.

Given these parameters, there were some different possibilities that I explored.

Not selected

Adaptive LASSO
- Pros: Feature Selection, Oracle Properties
- Cons: Prediction performance due to highly correlated predictors (exhibited in our data)
Adaptive Elastic Net
- Pros: Handles collinearity issue better than adaptive LASSO
- Cons: Not Oracle method, computation time, nonlinear shrinkage
SVM
- Pros: Prediction performance
- Cons: Non interpretable (hard classifier), computationally prohibitive for data size

Selected

Linear Discriminant Analysis (LDA)
- Pros: Soft classifier, computationally favorable
- Cons: Not robust against outliers, more assumptions about distribution of data
Logistic Regression
- Pros: Soft classifier, Feature selection (significant predictors, information criteria), parameter ( $\hat{\beta}$ ) estimates are consistent, distribution of $\hat{\beta}$ converges (data set is large), more robust to outliers
- Cons: possibly less computationally favorable than LDA

As an aside, it is not too much of a stretch to assume that each feature is normally distributed; each performance statistic is a measure of outcomes of random events in the game with some mean and variance that can be described by the population of players at each position, so both methods make sense to proceed with analysis.

Model Evaluation and Results

For the following details, please reference MATH574M Final Project SALCE R Code.Rmd available here

LDA and Logistic Regression Training and Prediction Procedure

Here we will have a look at how the models were trained and used to generate predictions, as well as iteratively refined to optimize prediction accuracy and efficiency. Keep in mind that we are not only interested in prediction and computational performance, but also feature selection, which we will analyze in results.

Function lda.analysis takes full player stats data as well as desired seasons for analysis, and the desired player position (1 through 5) as inputs. The function loops through each season, generates a train LDA object using lda function, uses the train LDA object to generate a prediction array $\hat{y}_{test}$ , and compares to the true train response array ytrain to calculate TrainErr. A 5 fold CV error is also computed with training data using LinearDA function from the PredPsych package. Then, an embedded loop runs through the other seasons using the train LDA object and test data to generate a $\hat{y}_{test}$ prediction array. The predicted array is compared to the true test response array ${y}_{test}$ to calculate TestErr. Results are returned in individual dataframes. TestErr dataframe column corresponds to training season, each row logs TestErr for test season prediction with LDA trained by training season.

Linear Discriminant Analysis Process

Function glm.analysis uses same procedure for model fit and train & test error as lda.analysis, using glm function function for logistic regression fit. The function also evaluates which features are significant for training fit and returns the features retained and dropped at an $\alpha \leq 0.05$ level, as well as top 5 features by significance.

Logistic Regression Process

Function fs.analysis takes same inputs as lda.analysis and generates both a glm as well as an lda training model object, and the computation time of fitting each model is captured. From the glm object, the significant features at an $\alpha \leq 0.05$ level are stored and the original training and test data matrices are filtered to only the significant features. The logistic regression and LDA analyses are then rerun using the same procedures in glm.analysis and lda.analysis, also capturing the computation time of the refit models. The function returns test and train errors for the rerun test data as well as all computation times.

LDA using features selected from Logistic Regression

Function ic.analysis takes same inputs as lda.analysis and generates a glm training model object. From the glm object, the significant features at an $\alpha \leq 0.05$ level are stored and the original training and test data matrices are filtered to only the significant features. Using the bestglm function from the bestglm package (which utilizes leaps package), the best models based on AIC and BIC are returned, and the coefficients of the best model stored. Note, given the size of the training data, the computation time for the bestglm function can vary from 5 second to 5 minutes, depending on the number of significant features from the original glm fit were retained. The AIC and BIC models could not be generated in any reasonable amount of time without some prior feature selection, hence using the initial feature screening. Due to some issues with code/time, only the computation time of glm and LDA refits using the AIC and BIC features are captured.

Results Summary

Prediction Error Comparisons

The below graphs summarize the average training and test error over all seasons by position, respectively. The training error graph includes the mean training error and a 1-standard error bar that captures the standard error within training error for a method by each position. The test error graph includes a mean test error point, a smaller set of points for average test error for a particular training model, and 1-standard error bars for each of these smaller points. Although a little noisy, it captures the test standard error for each training year, and gives a visual for the test error bounds of the results.

Error comparisons

Feature Selection Summary

Next, we summarize the number of features that were retained for each method by player position, as well as which features appear in the top 5 for a position class (based on model weights) by season for at least 3 seasons for a given position. The features that are selected can give some qualitative insights to how the model is making its predictions by position.

Feature selection

Training Computation Time Comparisons Between Methods

A quick summary of how each method compares in terms of time to train the model.

Computation time comparisons

Results Analysis

Error Comparisons

Overall, both the LDA and Logistic Regression perform fairly well in identifying player position class from game performance data. We can see that in general, feature selection does not impact performance substantially, so we can benefit computationally by reducing the number of features without significant performance impacts. The standard error for training and test are also not alarming, the results are generally fairly stable.

There are some clear differences in the performance by position class, and I offer some of my own intuition to at least gauge whether this seems reasonable...

Model error performance by position
- Best: Centers and Shooting Guards ( $\sim10\%$ )
  For a typical basketball team, a Center is easy to spot due to their large stature. But the role that they play within a game is also generally easier to discern than other positions; they are typically near the basket, shooting close and contested shots at the rim, grabbing rebounds, and on defense covering the other team's center who is often doing the same. They are generally less athletic, quick, and often cannot shoot well from very far away from the basket. It seems very reasonable that their performance stats are rather distinct due to overall style of play.
  Shooting Guards may be a little less intuitive to a casual fan, but they often execute more specialized roles within a team's strategy. Actions will often have them playing "off ball" (they don't have the ball in their hands much as much as say, the point guard), they set screens, assist other players, and "catch-and-shoot" the ball more than other players. They are typically not defensively statistically dominant players generally due to physical limitations like size (they aren't putting up a ton of blocks each game, for example). So, it's not surprising that their performance statistics may be more distinct than other players.
- Fair: Power Forwards ( $\sim 15 \%$ ) and Point Guards ( $\sim 22 \%$ )
  Point Guards tend to have the ball the most in the game, and consequently they can have a large variety of offensive statistical performance categories. They shoot, pass, and dribble more than most other positions, and often orchestrate how plays are executed. Defensively, they are often guarding other point guards, and are consequently involved in many defensive actions as well. Given the variety of statistical categories that they may perform well in for a particular game, it's reasonable to believe that it may be more difficult to differentiate them.
  Power forwards typically play similar roles to Centers, but are generally more athletic. They will often be utilized to grab rebounds, to shoot near the basket, and play defense between the basket and the three point line. There is a little more variety in how power forwards can perform since they can move around the court more as compared to centers, so the classification error rate for this position makes sense.
- Worst: Small Forwards ( $\sim 31 \%$ ) -
  In the modern NBA, by far the most versatile position is the small forward. They generally have enough size, skill, and athleticism to do just about anything that any other position can do. A great example is LeBron James, who is listed as a small forward but can play virtually any position on the court. It's particularly unsurprising, then, that it would be most difficult to discern small forwards from the other positions, so this result is entirely reasonable (but overall it still does meaninfully better than a coin flip).

Feature Selection

We can see features selected for position by year vary, which may imply that some positions play the game a little differently overall each year. The "Top Features" chart shows the features that recur between seasons (3 or more seasons), and the results gave me qualitative confidence in the model's performance, and also gave some interesting insight into what features may best describe each position.

The top features for centers were field goal attempts and 3 point field goal attempts; Centers take the smallest number of shots of any player overall, and especially the smallest number of 3 pointers. Guard play features were reasonable; point guards take many field goal attempts, and generally have the ball the most, and shooting guards set many screens offensively, and also shoot the ball. There was much more variety in the forward positions, so it was a little harder to discern what performance stats were really "most descriptive", but intuitively this is not an unusual finding given the versatility at those positions across the league.

Computational Comparisons

Feature selection clearly improves computational performance over using all of the original features, however there is not a huge benefit beyond the significant features for the further AIC and BIC refinement of the model from a computational perspective. Given the size of the data, the computation time just to acquire the best AIC and BIC model is prohibitive anyway. There are not statistics for these captured in this report, but the time could vary from a few seconds to as long as several minutes to recover a model that may improve feature selection by a small handful of features. LDA and logistic regression models are fairly quick to begin with, so I felt that the best value procedurally was to just capture the significant features from the logistic regression model and leave it at that.

Conclusions

Overall, LDA and logistic regression were quite reasonable models to address the original objective of player classification by standalone game performance statistics. The findings were rather surprising to me in the sense that they were much more reasonable and intuitive that I had expected based on previous experience during the semester with these and other models/data sets. Although this problem itself is fairly novel, the framework that this has provided for future analysis is of great value for my interests. There is certainly plenty of potential for further analysis of these results and beyond, so I look forward to exploring this data in greater depth. Further analysis using other nonparametric modeling approaches like random forests and PCA would fit some other problems using this data nicely.

Bibliography

Hastie, T., Tibshirani, R., Friedman, J., & Franklin, J. (2004). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Math. Intell., 27, 83–85. https://doi.org/10.1007/BF02985802

Macdonald, B. (2020). Recreating the Game: Using Player Tracking Data to Analyze Dynamics in Basketball and Football . Harvard Data Science Review, 2(4).

Oliver, D., Pelton, K., & Rosenbaum, D. T. (2011). Journal of Quantitative Analysis in Sports A Starting Point for Analyzing Basketball Statistics. https://api.semanticscholar.org/CorpusID:121583997

Puranmalka, K. (2014). Modelling the NBA to make better predictions.

Santos, J., Méndez-Domínguez, C., Nunes, C., Ruano, M., & Travassos, B. (2019). Examining the key performance indicators of all-star players and winning teams in elite futsal. International Journal of Performance Analysis in Sport, 20, 1–12. https://doi.org/10.1080/24748668.2019.1705643

Wang, J., & Fan, Q. (2021). Application of Machine Learning on NBA Data Sets. Journal of Physics: Conference Series, 1802. https://api.semanticscholar.org/CorpusID:233772057

Appendix

Report and Presentation

Reproducibility

Python data compilation

The process of compiling the data was rather arduous, and given my beginner/intermediate Python skills at the time, the scripts I used were a bit haphazard, and frankly not work that I am keen on distributing as something useful. Since the scripts are an eyesore, I will provide some example functions that you can use to queue the API in the ways that I did to hopefully build your own, better version of what I did. That said, it's important that I make the source scripts available, if for no other reason than to save you the trouble.

They are available here, and the final datasets that I generated and used for the project are available here.

Example queries

Here's a tailored example of some queries you may want to use and some open-ended functions that may also be of use. This script was generated much more handily from ChatGPT and modified manually in some areas.


import os
import time
import pandas as pd
from nba_api.stats.static import players
from nba_api.stats.endpoints import commonplayerinfo, playercareerstats
from nba_api.stats.endpoints import boxscoreadvancedv3, boxscoreplayertrackv3, boxscorehustlev2, playergamelog
from tqdm import tqdm

def get_active_players():
    """
    Fetch all active NBA players.

    Returns:
        list: A list of dictionaries containing active player data.
    """
    try:
        all_players = players.get_players()
        active_players = [player for player in all_players if player['is_active']]
        return active_players
    except Exception as e:
        print(f"Error fetching players data: {e}")
        return []

def fetch_player_stats(player_id):
    """
    Fetch various stats for a specific player.

    Args:
        player_id (int): The player's ID.

    Returns:
        dict: A dictionary containing the player's stats.
    """
    stats = {}
    try:
        career_stats = playercareerstats.PlayerCareerStats(player_id=player_id).get_data_frames()[0]
        stats['career'] = career_stats
    except Exception as e:
        print(f"Error fetching career stats for player ID {player_id}: {e}")
    
    try:
        gamelog = playergamelog.PlayerGameLog(player_id=player_id).get_data_frames()[0]
        stats['gamelog'] = gamelog
    except Exception as e:
        print(f"Error fetching game log for player ID {player_id}: {e}")

    return stats

def fetch_game_stats(game_id):
    """
    Fetch various stats for a specific game.

    Args:
        game_id (int): The unique game ID

    Returns:
        dict: A dictionary containing the game's stats.
    """
    stats = {}
    try:
        boxscore_adv = boxscoreadvancedv3.BoxScoreAdvancedV3(game_id=game_id).get_data_frames()[0]
        stats['boxscore_adv'] = boxscore_adv
    except Exception as e:
        print(f"Error fetching advanced boxscore for game ID {game_id}: {e}")

    try:
        boxscore_track = boxscoreplayertrackv3.BoxScorePlayerTrackV3(game_id=game_id).get_data_frames()[0]
        stats['boxscore_track'] = boxscore_track
    except Exception as e:
        print(f"Error fetching player tracking boxscore for game ID {game_id}: {e}")
              
    try:
        boxscore_hustle = boxscorehustlev2.BoxScoreHustleV2(game_id=game_id).get_data_frames()[0]
        stats['boxscore_hustle'] = boxscore_hustle
    except Exception as e:
        print(f"Error fetching hustle boxscore for game ID {game_id}: {e}")

    return stats



def main():
    # Get list of all active players
    active_players = get_active_players()
    
    # Initialize lists to hold data
    career_data = []
    gamelog_data = []
    boxscore_adv_data = []
    boxscore_track_data = []
    boxscore_hustle_data = []

    # Fetch stats for each player
    for player in tqdm(active_players, desc="Fetching player stats"):
        player_id = player['id']
        player_name = player['full_name']
        stats = fetch_player_stats(player_id)
        
        if 'career' in stats:
            stats['career']['PlayerName'] = player_name
            career_data.append(stats['career'])
        if 'gamelog' in stats:
            stats['gamelog']['PlayerName'] = player_name
            gamelog_data.append(stats['gamelog'])


        time.sleep(1)  # To avoid hitting rate limits

    # Combine all data into one DataFrame
    combined_df = pd.concat(
        career_data + gamelog_data + boxscore_adv_data + boxscore_track_data + boxscore_hustle_data, 
        ignore_index=True, 
        sort=False
    )
    
    # Save to a CSV file
    output_file = "combined_nba_player_stats.csv"
    combined_df.to_csv(output_file, index=False)
    print(f"Data saved to {output_file}")

if __name__ == "__main__":
    main()

R Scripts

Link back up to Data Refinement, Training and Test Data, Model Evaluation and Results

All R Scripts are available in the source $R$ file for the project are available here. The scripts in the MATH574M Final Project SALCE R Code.Rmd file can be run to reproduce the data from this experiment, just be sure to run the Import data portion of the script in its entirety before proceeding.

Reference Resources

Link back up to Background section

There are four collections of features that will be used for training and test data.

Traditional Boxscore - Standard summary stats for each player per game (see “NBA Box Score” above)
Hustle Stats - Stats that describe “effort” plays not captured by traditional boxscore (see “Hustle Stats”above)
Advanced Boxscore - “Advanced” stats that typically summarize performance calculated by other performance inputs. Note, as we will see, some of these advanced stats, among others, are highly correlated with other features.
Player Tracking Stats - Summary statistics calculated from in-game player positional tracking data.
NBA Seasons will be referenced by only their start year, i.e. the “2022” season is referencing the “2022- 2023” season.
Player positions will may be referred to by number, where 1-PG, 2-SG, 3-SF, 4-PF, 5-C.

Terminology

Link back up to Data section

Dimensions of the data will be expressed in terms of the number of features (predictors) $p$ , and the number of observations $n$ . A training or test data matrix, for example, will have dimensions $n \times p$ , and a response array will have dimensions $n \times 1$ .

“Test Error” and “Train Error” describe the misclassification rates for model predictions against the true response. Misclassification rate is defined as

\begin{aligned} \frac{1}{n} \sum^{n}_{i=1}|y_{i}-\hat{y_{i}}| \end{aligned}

Where $y_i$ is the training or test response data and $\hat{y_{i}}$ is the predicted response data.

“Binary Classification” and “binary classifiers” in this report implies 0/1 binary response assignment, rather than $-1/1$ classification typical of some methods.