Regression Estimation

Friday, November 22, 2024

Return to 2009 Local Area Transportation Characteristics by Household (LATCH) methodology

The relationship between each dependent variable and the explanatory variables was estimated using multiple linear regression^[4].

Transportation variable (PT, VT, PMT, or VMT)	∝	HH Income or Nat. Log (HH Income)
		Count of HH Vehicles
		Count of HH Members
		Homeowner
		Number of Workers
		Life Cycle (1+C<18)
		Life Cycle (1P hh<65)
		Life Cycle (2+P hh, 0 65+)
		Life Cycle (2+P hh, 1+65+)

The results of those regressions are shown in Appendix A. All of the coefficients are significant at a 5 percent confidence level. The choice of household income in linear or log form was determined by the variable giving the best fit (highest adjusted R²value). A comparison of the mean values for each category and the estimates using the regression equations are shown in Table 7. The regression equations use the mean values for each of the explanatory variables in the equation for that group. All regression estimates are within a 99% confidence interval of the mean values.

Due to the correlation between some of the explanatory variables, the impact of multicollinearity on the regression results was examined using principal components. The results of that analysis and the potential of using principal components as an alternative estimation technique are discussed below.

Principal Component Analysis

Several of the variables chosen to estimate household travel patterns measure similar concepts. For example, the presence or absence of persons 18 or younger in the household and the number of household members are means for describing the composition of a household. Because these two variables measure household composition in slightly different ways, both are included in the model. However, inclusion of both introduces multicollinearity into the model. The presence of multicollinarity is suspected given the relatively high degree of correlation between the independent variables (for an example, see Table 8).

Multicollinearity decreases the reliability of the independent variables in predicting household travel by inflating the variance of the independent variables, which makes it difficult to assess whether a specific independent variable is statistically significant. To more accurately capture the effect of the independent variables, principle component analysis (PCA) was explored to collapse the explanatory variables into a smaller number of artificial variables called principal components (PCs) that are serially uncorrelated with one another. The PCs are then used in place of the independent variables to predict household travel.

The PCs were selected using the same independent variables and the same initial sample used to create the linear regression models for each Census region/division and urban group. The selection process was performed in SAS through proc factor, principle component analysis with varimax rotation and with a user-specified correlation matrix containing the Pearson correlation between continuous independent variables, polychoric correlation between binary and continuous independent variables, and tetrachoric correlation between binary independent variables. The life-cycle indicators were not included in the analysis. Inclusion of the life-cycle indicators requires forcing the correlation amongst the indicators to be zero (since the indicators are mutually exclusive). This causes the life-cycle indicators to load on separate PCs. Since each PC theoretically measures a distinct construct, the dispersion of the life-cycle indicators makes little theoretical sense. Theoretically, the life-cycle indicators ought to belong to the same PC, because they all measure the same construct (household life-cycle). Since this cannot occur when serially uncorrelated, the life-cycle indicators are excluded so as to produce interpretable PCs from the remaining independent variables.

In each Census region/division and urban group, 70 percent or more of the variation in the independent variables could be explained by two PCs. Household size and the presence of a child loaded strongly onto the first PC across all Census region/division and urban groups. Across all Census region/divisions and urban groups, only one variable loaded strongly onto the second PC and onto all subsequent and less significant PCs. Since more than one variable loaded strongly onto only the first PC, only the first PC was used in place of the independent variables it represented. The other PCs represented one of the independent variables that did not load onto the first PC and in representing only one of the independent variables, the other PCs provided no variable reduction advantage. The independent variables that loaded onto these PCs were used as regressors, in their original form, with the PC representing household size and the presence of a child to predict each household travel measure (vehicle miles traveled, person miles traveled, vehicle trips, and person trips). The results of these principle component regression (PCR) models were compared to the results from the linear regression models with all regressors in their original specification. There was no significant improvement in the fit of the models (in terms of the AIC, BIC, or adjusted R-squared) from using PCR. This was not unexpected because multicollinearity typically does not affect the fit of the model. Instead, multicollinearity affects the reliability of the predictors. Since the goal here is not to isolate the effect of individual regressors and separately measure the effects of each independent variable on household travel, linear regression models without PCs were selected after verifying that they otherwise fit the data as well as the models with PCs.

Validation

The linear regression models were evaluated for their prediction accuracy at the Census tract level. This was done by comparing the mean number of vehicle miles traveled, person miles traveled, vehicle trips, and person trips to the number calculated from the corresponding regression model. The non-public 2009 NHTS files were used to calculate the mean value of the four household travel variables in each Census tract. Predicted values were calculated using both the non-public 2009 NHTS files and the 2005-2009 American Community Survey (ACS) dataset. The 2005-2009 ACS dataset was used for evaluation rather than the 2007-2010 dataset, which was used to make travel estimates, because it uses the same statistical boundaries for Census tracts as the 2009 NHTS. This means that predicted household travel can be compared to the average estimated from households in the NHTS dataset that belong to the same Census tract^[5].

Household vehicle miles traveled, person miles traveled, vehicle trips, and person trips were predicted for each Census tract in two different ways:

(1) By calculating the mean household values of each census tract for each independent variable^[6] from the NHTS dataset, and inserting them into the appropriate regression equation for that Census tract, and

(2) By inserting the value extracted from the ACS dataset for each independent variable into the appropriate regression equation for each Census tract (see Table 9 for an example)^[7].

The predicted values were compared to the NHTS values in all Census tracts where at least eight or more households were surveyed for the NHTS. This size requirement, developed in conversations with some researchers of the previous NHTS study, provides greater confidence in representing a given Census tract. However, requiring more than eight households reduces the number of Census tracts that can be evaluated for their prediction accuracy. See Tables B1 to B4 in Appendix B for the count of Census tracts in each Census region/division and urban group that had eight or more households and the necessary data for making an accuracy assessment.

To aid in the assessment of the quality of the models For the Census tracts where a comparison could be made, the absolute percent error between the NHTS value and the predicted value in each Census tract was calculated and then compared to the median of these errors in each Census region/division and urban group to arrive at the median absolute percent error (MAPE) (see Tables B1 to B4 in Appendix B). The models for vehicle trips and person trips tend to predict better than the models for vehicle miles and person miles, as they show much smaller MAPEs across all Census region/divisions and urban groups. The MAPE tends to be larger in Census region/divisions and urban groups in all models where the medians for the independent variables from the NHTS dataset are significantly different from the medians from the ACS dataset for all Census tracts included in the evaluation (see the MAPE in Tables B5 to B8 in Appendix B). These differences in the median values are a major contributor to the larger MAPEs.

Travel Variable Estimates by Census Tract

After confirming that the models predict household travel well, estimates for the four household travel variables were made for all Census tracts in the U.S., with the exception of Census tracts in Manhattan^{^[8]}, using the 2007-2011 ACS. The 2007-2011 dataset was selected over ACS datasets in prior years, because it is the latest ACS release with the demographic and socio-economic data needed to make travel predictions at the Census tract level. The 2007-2011 dataset uses 2010 Census tract definitions rather than the 2000 definitions used in the 2009 NHTS. This difference precluded the 2007-2011 ACS dataset from being used to evaluate the prediction accuracy of the models but does not preclude the dataset from being used to make Census tract estimates of household travel. The models for predicting household travel were developed independently from Census tract boundaries and hence can be used for predicting household travel for any geographic entity within a Census region/division where the Census region/division id and urban group are known.

The urban group for all Census tracts in the 2007-2011 ACS dataset was identified per the method described in above section on the development of the urbanicity index using 2010 Census boundaries and population information^{^[9]}. A list of the data pulled from the 2007-2011 American Community Survey 5-year estimates data files can be found in Table 10. The ACS data were used to predict household travel. The estimates of household travel were evaluated non-spatially and spatially for reasonability. Spatial evaluation was performed first per the method below.

Spatial identification of unreliable estimates

Prior to testing for spatial reasonableness, estimates of household travel were examined, since the quality of the spatial analysis can be compromised by extreme values. Extreme estimates were defined as those less than the 1^st percentile and greater than the 99^th percentile of the NHTS mean value in a given Census region/division and urban group. No estimates were identified as extremes per these criteria.

Spatial reasonableness was examined by comparing household travel estimates for a Census tract to those of its neighbors. Neighboring tracts were defined as those that share an edge or a corner with the Census tract being evaluated for spatial reasonableness. If the household travel estimate for one or more of the four variables was significantly lower or higher than that of neighboring tracts, the significantly different estimate was considered spatially unreliable. This was performed by first using ArcGIS to identify the neighbors of each Census tract and then by calculating the Moran’s I statistic for each Census tract in SAS.

The Moran’s I Statistic

As shown in the formula, the statistic involves the use of the overall mean (x^-) in comparing a Census tract (x_i) to its neighbors (x_j). Since the models were developed specific to a Census region/division and urban group, the mean for the Census region/division and urban group were used in the formula rather than the overall mean. The Moran’s I statistics were evaluated for statistical significance by calculating the z-score. Negative Moran’s I statistics with statistically significant z-scores at the 99% confidence interval belong to estimates that are dissimilar from surrounding values. These estimates are marked as being spatially unreliable. Only a few estimates were marked as such and suppressed.

Non-spatial identification of outliers

After testing for spatial reasonableness, the estimates were evaluated further for non-spatial reasonableness by examining the distribution of the estimates. With the exception of vehicle miles traveled for five distinct Census tracts, all estimates for all variables were found to be within the range of values for the same travel variable used to create the regression model. The exceptions to this were resolved by suppressing the estimates for all household travel estimates within the bottom and top 0.5 percent of the distribution of the estimates. This tightened the range of the estimates and resulted in all estimates being within the range of values used to create the regression model (see Tables 11, 12, 13 and 14 for final counts and distribution of the estimates after completion of spatial and non-spatial reasonableness checks and see Appendix C for a description of the final dataset).

[4] Example of SAS statement;

[5] There are a few exceptions as a few Census tracts in the 2005-2009 ACS were defined by their 2010 statistical boundaries rather than their 2000 boundaries. Only Census tracts defined by their 2000 boundaries were evaluated for their prediction accuracy since the Census tract geographic identifiers in the non-public 2009 NHTS are based on the 2000 definitions.

[6] The final household weight was in creating the linear regression models to predict household travel but was not used in calculating the mean household characteristics for a given Census tract since the final household weight was not intended to make households within a Census representative of the Census tract itself.

[7] Census Tracts with a group quarters population were excluded because the number of workers per household could not be calculated, using the 2005-2009 ACS, without including workers living in group quarters (and hence not in a household) in the count of workers.

[8] Census tracts in Manhattan were identified from the New York City Department of Urban Planning: (BYTES of the BIG APPLE™ - Archive (nyc.gov)) and were suppressed given the significant difference in travel behavior between those living in Manhattan and those outside of Manhattan but still in the same Census region/division.

[9] Per the U.S. Census Bureau, there are a few Census tracts in the 2005-2007 ACS with geo-identifiers that are different from the 2010 geographic definitions used for all other tracts. In a few instances, the geographic definitions remained the same but the numbering of the tract changed. In all other instances, both the geographic definitions and the numbering changed. For tracts that retained their geographic definitions but changed numbering, the new numbering was replaced with the numbering used in the 2010 geographic definitions. These tracts included: '36053940101' renumbered as '36053030101'; '36053940102' renumbered as '36053030102'; '36053940103' renumbered as '36053030103'; '36053940200'; renumbered as '36053030200'; '36053940300' renumbered as '36053030300'; '36053940401' renumbered as '36053030401'; '36053940403'; renumbered as '36053030403'; '36053940600' renumbered as '36053030600'; '36053940700' renumbered as '36053030402';'36065940000'; renumbered as '36065024800'; '36065940100' renumbered as '36065024700'; '36065940200' renumbered as '36065024900'.