Preliminary Estimates for Enplanements from TSA Checkpoint Inspections Using Correlational Models and Linear Regression
- DOI link
- Text
- Preliminary Estimates page
Abstract
Passenger enplanement numbers are collected on a monthly basis and published 3-4 months after the month has ended. Transportation Security Administration (TSA) checkpoint screening is closely linked to passenger enplanement, and TSA data is available a day after the month has ended, making it a suitable proxy for predicting enplanement numbers. Analyzing 2022-2023 data, correlations between monthly enplanements and screenings are strong, and simple linear regression is an effective modeling framework for understanding the relationship between enplanements and screenings. Predictive performance is assessed for the first three months of 2024 data and via cross-validation. Extensions to the modeling framework and future directions are considered.
Key Words: Linear Regression, Correlational, TSA, T100, Aviation
1. Introduction
Enplanement refers to an individual passenger boarding a plane at a designated airport. US passenger enplanement numbers are collected at a monthly cadence and typically released 3-4 months after the fact. Every passenger boarding a plane must pass through a Transportation Security Administration (TSA) checkpoint at the airport of entry to air travel. National TSA screening numbers are aggregated at a daily level and published the following day, hence monthly TSA checkpoint screenings may be tabulated a day after the month has ended. The close connection between screenings and enplanements alongside the early availability of screening data relative to enplanement numbers make TSA screenings a suitable predictor of passenger enplanements. This technical report covers the use of simple linear regression models to predict passenger enplanements from TSA screenings with high accuracy.
To understand the current relationship between TSA security screenings and passenger enplanements, we examine data from 2022 onward. The year 2019 is found to have similar characteristics to 2022-2023, and so it is included in the non-pandemic data analysis. The COVID-19 pandemic brought lockdowns and quarantine measures creating major shocks to existing aviation transport patterns. The year 2020 saw massive dips in air travel and passenger enplanements. Starting in 2022, US air travel returned to near normal, with North American passenger numbers attaining 94% of 2019 ridership[1]. In the Discussion and Further Directions section, we will discuss prior years’ data and its implications. The following Data section provides an overview of the data sources and characteristics. In the Linear Regression section, the linear regression model and pertinent facts are reviewed. The Model Diagnostics section provides an overview of diagnostics used to evaluate the linear regression models. Model Fit and Prediction covers the model performance trained on different data. The Discussion and Further
Directions covers the full data set, seasonality, key takeaways, and extended ideas. Finally the Gallery and Tables section report graphs and statistical tables corresponding to the linear models tested.
* Bureau of Transportation Statistics, US Department of Transportation
2. Data
The Form-41 Air Carrier Statistics database, also referred to as the T100 data bank collects air carrier traffic data for domestic and international flights. The database contains several tables with data on the domestic, international, and combined markets. T100 domestic market data contains information on commercial airplane flights in the United States with domestic origin and destination points. T100 international market contains data on US flights with an international endpoint. The combined market data aggregates domestic and international numbers. Air carriers report flight data to the Office of Airline Information (OAI) at the Bureau of Transportation Statistics (BTS) which compiles and stores the data. Due to legal restrictions, data for a given month is not released until 3-4 months after collection. The T100 market data is publicly available on the Bureau of Transportation Statistics’ (BTS) TranStats page [2].
The Transportation Security Administration (TSA) is a branch of the Department of Homeland Security that conducts security screenings for air travel in the United States. All air passengers traveling through a domestic airport must pass a TSA inspection checkpoint. The TSA collects data on security screenings and publishes daily tabulations of security screening volumes on their website. These numbers are available for a given day the following day and may be accessed through the “Passenger Volumes” page on the TSA website [3].
Every air passenger originating in the United States must pass through a security screening. However, TSA security inspection numbers diverge from passenger enplanements in several ways. TSA security checks contain screenings of airport employees and security staff who generally do not board planes. Flight crews are also screened, but their boarding is not counted as a passenger enplanement. Furthermore, individuals may be subject to more than one screening if they leave and come back to the airport. On the other hand, when a traveler has multiple layovers with connecting flights, they board multiple aircraft resulting in several enplanements. But assuming they never leave the connecting airports, they will only be subject to a single screening at the first airport where their air travel began. Hence TSA screenings and passenger enplanements are not one-to-one. Nevertheless, given the majority of TSA inspections are generated by flying passengers, there is a close association between enplanements and inspections. The rapid release of TSA screening numbers and the delayed release of passenger enplanement data, make TSA screenings a suitable proxy for the advanced prediction of enplanement volumes.
3. Linear Regression
Regression analysis was first introduced by Sir Francis Galton in the late 1800s for predicting the height of offspring from the heights of parents. Simple linear regression is a technique for predicting a continuous numerical output 𝑌 from a predictor variable 𝑋. Each individual observation is denoted by a subscript 𝑖 = 1,...,𝑛 so that each observation is a pair of data points (𝑋𝑖,𝑌𝑖). The functional form of the model is given as:
Yi = β0 + β1Xi + εi = ¯ Yi + εi (1)
The quantity 𝑌̅𝑖 = 𝛽0 + 𝛽1𝑋𝑖 is the predicted value of Yi given the predictor value 𝑋𝑖. Then 𝜀𝑖 = 𝑌𝑖 −𝑌̅𝑖 is the error, a random noise term describing the discrepancy between the predicted value and the observed value.
Typically Greek letters are used to denote the “true” model, that is theoretical population level model for all 𝑁 data points in the population. Typically a linear regression model is fitted to a collected sample of data say (𝑋𝑖,𝑌𝑖) for 𝑖 = 1,...,𝑛 < 𝑁. This fit to sample data is usually accomplished with the ordinary least squares (OLS) algorithm. By convention, parameter estimates from fitting the regression model to the sample data are denoted 𝑏0 and 𝑏1. If the following conditions hold:
- (linearity) The relationship between 𝑋 and 𝑌 is captured by eqn 1 and 𝛽0 and 𝛽1 are constant
- (zero-mean errors) The mean of the errors is zero 𝐸[𝜀𝑖] = 0 for all 𝑖
- (homoscedasticity) The errors have constant variance var(𝜀𝑖) = 𝜎2 for all 𝑖
- (uncorrelated errors) The errors are uncorrelated corr(𝜀𝑖,𝜀𝑗) = 0 for all 𝑖≠𝑗
Then we may invoke the Gauss-Markov theorem, which provides performance guarantees for sample estimators. Specifically the Gauss-Markov theorem states that under these conditions, the OLS parameter estimates are the best linear unbiased estimators (BLUE). This means that the OLS estimators b0 and b1 are unbiased 𝐸[𝑏0] = 𝛽0 and 𝐸[𝑏1] = 𝛽1 and have the minimum variance among all linear models. Normality is not necessary for this optimality but is often assumed for convenience. Under the assumption of normally distributed errors, the classical statistical tests, p-values, and confidence intervals are valid. Under the assumption of normality, the OLS, maximum likelihood estimate (MLE), and the method of moments estimate (MME) approaches all yield the same regression equation.
Multiple linear regression refers to a linear regression model with more than one predictor variable. The regression model is then denoted 𝑌𝑖 = 𝛽0,𝑖 + 𝛽1,𝑖𝑋1,𝑖 + ...+ 𝛽𝑝,𝑖𝑋𝑝,𝑖 + 𝜀𝑖. Here 𝑖 is the observation index and the indices 0,1,...,𝑝 refer to the 𝑝 > 1 predictors. Under the same conditions as simple linear regression, the Gauss-Markov theorem still holds and provides a suitable estimator.
A final note on regression models relevant to this modeling exercise. When the 𝑋 values are not known constants that are fixed by the researcher but are random variables, as with the TSA screening counts, we formally have a correlational model between 𝑋 and 𝑌 as opposed to a classical regression model setup. In such cases a linear regression model is still appropriate given the following two conditions hold on 𝑋 and 𝑌 [4]:
- (conditional normality) Even if 𝑋 and 𝑌 are not bivariate normal, the conditional distribution of 𝑌𝑖 given 𝑋𝑖 is normal with mean 𝛽0 + 𝛽1𝑋𝑖 and constant variance 𝜎2
- (uncorrelated predictors and parameters) The 𝑋𝑖 are independent random variables and their distribution does not depend on the regression model parameters 𝛽0, 𝛽1, and 𝜎2
4. Model Diagnostics
Using ¯ Yi = β0+β1Xi to denote the population model estimate of Yi, we denote sample model estimate ˆ Yi = b0+b1Xi. The corresponding sample error may be denoted ei = Yi− ˆ Yi. Then three important quantities for model diagnostics are the sum of squared errors (SSE), sum
of squares due to regression (SSR), and sum of squares total (SST). Where ¯ Y denotes the overall mean of Y , these quantities are defined as follows:
so that these squared quantities give us a decomposition of the total variation of the Yi around their mean ¯ Y into the component explained by the model (SSR) and the component that is residual to the model (SSE).
One of the diagnostic parameters of interest calculated from the squared sums is the coefficient of determination R2. It represents the proportion of variation Y that can be explained, in a linear fashion, by X and is given:
As the terminology suggests, the R2 value is the square of r, the correlation between X and Y . The Pearson correlation coefficient or simply correlation is a measure of linear association between two variables. It ranges from −1 to 1 with a value of 0 indicating no linear relationship (there may still be nonlinear associations) and a value of 1 (−1) indicating a perfect, positive (negative), deterministic linear relationship. In the simple linear regression model the correlation between X and Y has a direct relationship with the OLS slope coefficient of the model b1:
Where σX, σY denote the standard deviations of X and Y respectively. Hence when r = 0 the slope is effectively 0 and there is no linear relationship between X and Y ; when r = ±1 there is a perfect linear relationship between X and Y .
When in the setting of multiple linear regression, a modified version of R2, the adjusted R2 is used:
Inspection of the formula shows that as p increases, with other values held constant, the adjusted R2 will decrease. Adding variables to the model will never decrease the regular R2 and almost always lead to an increase. Therefore adjusted R2 penalizes the addition of
predictors. Hence, if the adjusted R2 increases, it signifies the additional explanatory power overcomes the penalty for variable addition.
The R2 and adjusted R2 provide a sense of strength of model fit to the sample data. In some contexts, out-of-sample prediction accuracy is the objective of regression modeling. Here another notion of R2, the predicted R2, is used to assess predictive performance. A out-of-sample model validation technique, leave one out cross-validation (LOOCV), is used to obtain the predicted R2. Each individual point is removed from the dataset. The model is fit to the remaining data points, then used to predict the output value for the observation
that was removed. When the procedure is complete, for each point the predicted output value is subtracted from the observed output and squared. The sum of these values is the predicted residual error sum of squares (PRESS). In the predicted R2 the PRESS replaces
the SSE in the R2 formula:
This provides a notion of the predictive explanatory power, the power to predict out-ofsample
observations.
In addition to R2 values commonly used in regression diagnostics and given this paper
considers forecasting applications of regression, a popular forecasting metric, the mean absolute
percent error (MAPE) is used to average forecast accuracy. The absolute percentage
error of an observation, given an observation Yi and a prediction ˆ Yi is:
The MAPE is then just an average of a series of predictions:
5. Model Fit and Prediction
To begin the analysis and obtain a sense of the appropriateness simpler linear regression, correlations were calculated between the national TSA inspection checks and the full, domestic, and international T100 market data. Data from 2022 and 2023 was used to avoid issues with the pandemic era data. Correlations of 99.51%, 98.53%, and 93.35% for the full, domestic, and international data respectively. Initial modeling and analysis showed 2019 data followed a similar pattern to 2022 and 2023, so 2019 was included in the linear model training set; hence the years 2019 and 2022-2023 can be thought of as “normal” or “typical” ranges of activity. For this data, in addition to a simple linear model, a model with seasonality was also tested for all three T100 market datasets. For each data set, seasonality was assessed from the residuals of the simple linear regression model. Note that this is residual seasonality not accounted for by the linear regression; the bulk of the seasonal patterns in aviation travel is reflected in the TSA check data, hence accounted for in the regression model.
Then the data was modeled with 2020 and 2021 data included. The 2020 and 2021 data provide an example of how aviation data might behave during a crisis that drastically lowers air travel. Here a linear model is tested alongside a quadratic model to provide greater flexibility in modeling the larger range of data observations. However, data in the mid-lower to lower ranges of the dataset are sparse. Additionally future crises may introduce shocks that differ in manner and degree of effect on air travel.
For each model trained, the statistical fit was assessed and adjusted 𝑅2 was examined. The data was used to predict the first 3 months of 2024. From this a MAPE was calculated. Then to further assess predictive potential, LOOCV was used, with the inclusion of 2024. From this cross validation procedure, a MAPE and predicted 𝑅2 were calculated.
For the full T100 market data, a linear model trained on 2019, 2022, and 2023 produced an adjusted 𝑅2 of 98.96% and had a MAPE of 0.89%. Then including 2024 data and using cross validation yielded a MAPE of 0.95% and a predicted 𝑅2 of 98.88%. The residual plots appeared to show three distinct seasons: a high travel period June through August, a low travel period October through December, and intermediate air travel January through May and September. Incorporating these distinct seasons into the model yielded an adjusted 𝑅2 of 99.32% and a MAPE of 0.81%. Cross validation returned a MAPE of 0.82% and a predicted 𝑅2 of 99.19%.
Including the pandemic era data for the full T100 market numbers, the linear model returned an adjusted 𝑅2 of 99.84% and a MAPE of 0.71%. The cross validation procedure yielded a MAPE of 2.2% and a predicted 𝑅2 of 99.83%. Then the quadratic model was assessed, returning an adjusted 𝑅2 of 99.87% and a MAPE of .9%. The cross validation procedure had a MAPE of 1.72% and a predicted 𝑅2 of 99.85%.
With the T100 domestic market data, the simple linear regression model yielded an adjusted 𝑅2 of 97.09% and a MAPE of 2.74%. Cross validation MAPE and predicted 𝑅2 were 1.52% and 96.49%. Residual plots from this regression appeared to show two distinct seasons: lower travel December through February and greater travel March through November. Adding this seasonality to the model yield an adjusted 𝑅2 of 98.5% and a MAPE of 1.15%. Cross validation produced a MAPE of 1.07% and a predicted 𝑅2 of 98.26%. Incorporating the domestic enplanement data from the pandemic years, the linear regression model returns an adjusted 𝑅2 of 99.03% and a MAPE of 3.49%. Cross validation returned a MAPE of 4.88% and a predicted 𝑅2 of 98.93%. The quadratic model yielded an adjusted 𝑅2 of 99.29% and a MAPE of 3.38%. Cross validation produced a MAPE of 3.01% and a MAPE of 99.17%.
For the international enplanement data, simple linear regression yields an adjusted 𝑅2 of 83.49% and a MAPE of 6.36%. Cross validation returned a MAPE of 6.52% and a predicted 𝑅2 of 80.73%. Inspection of the residuals revealed three apparent seasons: a low travel season in March, October, and November; a high travel season in January, February, July, and August; and an intermediate season April through June and December. The incorporation of season in the model yielded an adjusted 𝑅2 of 91.74% and a MAPE of 4.66%. Cross validation produced a MAPE of 4.75% and a predicted 𝑅2 of 89.81%.
The inclusion of pandemic era data into the simpler linear regression model for international enplanements yielded an adjusted 𝑅2 of 89.78% and a MAPE of 8.38%. Cross validation returned a MAPE of 50.12% and a predicted 𝑅2 of 89.23%. Note, the massive MAPE for this data was largely driven by plummeting international flights during April and May of 2020 at the onset of the pandemic. Removing these two data points reduced the MAPE to about 15%. The quadratic model produced an adjusted 𝑅2 of 93.55% and a MAPE of 10.95%. Cross validation for this model produced a MAPE of 17.19% and a predicted 𝑅2 of 93%.
6. Discussion and Further Directions
The results of simple linear regression modeling show that it is possible to forecast enplanement numbers in the normal ranges of activity with simple linear regression on TSA inspection counts. These models yielded greater than 90% accuracy, as measured by MAPE, significantly greater for domestic and full enplanement data. The incorporation of residual seasonality into the linear modeling produced modest albeit consistent and statistically significant improvements to model performance for all three data sets. The improvements borne by seasonality were stronger in the domestic model than in the full data model and stronger in the international model than the domestic one. Ideally, more data would enable a refinement of the seasonal component; future work might, for instance, use SARIMA or other time series models to account for residual seasonal patterns.
The pandemic era data provides a unique glimpse at how a major crisis may affect air travel numbers. As with seasonality for the non-pandemic data, adding a quadratic term produced consistent and statistically significant improvements to the model. Also as with seasonality, the effect was most pronounced on the international data and least on the full data. For the full data, the quadratic term was extremely small, leaving the model close to linear; improvements were also quite modest. Improvements were greater in the domestic dataset and greater still in the international dataset. Given that international data is more expensive, time-consuming, and wrought with greater hassles, it stands to reason that international enplanement numbers are more sensitive to seasonal and major crisis effects.
While the pandemic era data provide a rare view of how air travel is affected by major disruptions, there is a need for more data in these lower activity ranges for greater model certainty. There may be an element of overfitting in the lower ranges with the full dataset. Also, more data in the lower ranges of air travel activity would also enable the incorporation of seasonality in these models. Caution is advised when predicting in these non-standard ranges. Model deployment may be accompanied with change point detection to identify pattern shifts and irregularities.
Given that travel patterns will shift over time, extensions to this work may consider mechanisms to adjust for this. Weighted regression is an option, allowing for greater emphasis on more recent observations. The simplest possibility would be to use a moving window, e.g. 2-3 years of data. More complex schemes could progressively upweight more recent observations. In the current dataset, there was no indication of heteroskedasticity. Nonetheless data was sparse in certain ranges. If further data collection reveals nonconstant variance, weighted regression may also be used to deal with variable variance [4]. Another possibility is the use of Bayesian regression to incorporate past and present data. Parameter estimates from the distant past data may be incorporated into present and future estimates via prior distributions and Bayesian inference.
With the current data, linear regression models are capable of explaining most of the variation in passenger enplanements from TSA screenings. Advanced nonlinear regression frameworks such as decision tree regression and neural network regression are likely more flexible than necessary and risk overfitting. The advent of more detailed and complex predictors could warrant the use of more flexible nonlinear regression methods. Variables such as flight cancellations and measures of airport operations may account for some of the residual discrepancy between model predictions and observed outcomes. A necessary condition for the predictive value of additional potential predictors is that they be available before the release of T100 passenger enplanement data. Ideally, such additional data would be released in tandem with TSA screenings for immediate relevance
8. Tables
We denote each model by the data set (full, domestic, international) and by the model type (simple linear, seasonal, quadratic). The formula column has the model functional form. 𝑌 denotes the passengers, 𝑋 represents TSA checks, and 𝑆 variables are seasonal factors. The Estimates column contains point estimates and 95% confidence interval.
The following table summarizes the fit diagnostics that also appear in the text.
The table below showcases actual T100 passenger numbers for the first three months of 2024 alongside 95% confidence intervals for the predictions. This provides a sense of the uncertainty in predictions can be used to provide a predicted range of values in addition to a point estimate.
9. References
(1) Julie Peasley. When will air travel return to pre-pandemic levels?, 2022. URL https://www.weforum.org/agenda/2022/12/when-will-air-travel-return-to-pre-pandemic-levels/.
(2) Bureau of Transportation Statistics. Air carriers : T-100 domestic market (u.s. carriers), 2024. URL https://www.transtats.bts.gov/Fields.asp?gnoyr_VQ=FIL.
(3) Transportation Security Administration. Tsa checkpoint travel numbers (current year versus prior year/same weekday), 2024. URL https://www.tsa.gov/travel/ passenger-volumes.
(4) Michael H. Kutner, Christopher J. Nachtsheim, John Neter, and William Li. Applied Linear Statistical Models. Operations and Decision Sciences. McGraw-Hill Irwin, 2005. ISBN 0072386886.