Legacy Book

You are here

Modeling School Bus Crashes Using Zero-Inflated Model


1Department of Civil Engineering
Tennessee State University
Nashville, TN 37209

2University of North Florida

3Western Michigan University

* Corresponding author.
Tel.: +1 615 963 5430
Email address: dchimba@tnstate.edu

KEYWORDS: school bus crashes; zero-inflated model, Poisson, Negative Binomial


School bus crashes are rare, but their occurrence can have devastating effects on the school children involved. Such crashes are infrequent and random, and some roadway segments may not experience any school bus related crashes for a number of years (zero crashes). Despite the fact that no crashes may have occurred along particular stretches of road, these zero-crash road segments cannot be termed as safe sites, and they cause a dual state of crash experience (no crashes, but still at risk for crashes) compared to a single state of non-zero crash prone sections where risk is confirmed. Literature indicates that for extremely rare and random count data, such as school bus crashes, Poisson and Negative Binomial (NB) distributions become more applicable for modeling. Apart from Poisson and NB, there exists an alternative discrete distributional model that is used to model extra-zero discrete data, such as school bus crashes,that allows exploration of the impact of zero segments. This alternative modeling approach called zero-inflated negative binomial (ZINB) model is introduced in this study for evaluation of variables influencing school bus crashes. Although crash data rarely reveal variability, the ZINB model provides a more flexible modeling framework for school bus crashes. The study found that, ZINB yields better prediction (tight standard errors and higher z-statistics), compared to NB model though same variable coefficient signs.

Presence of median and outside shoulders was found to have tendency of reducing school bus crashes. On the other end, wider medians, outside shoulders, inside shoulders, and lane widths were found to reduce the probability of these crashes. Presence of curb and gutter and two-way left turn lane (TWLTLL, high posted speed limits, multilane segments, and congested segments were found to increase the probability of school bus crashes.


School bus crashes are rare, but their occurrence can have devastating effects on the school children involved. Limited studies have been published on the statistical modeling of school bus related crashes. Yang et al. (2009) indicated that findings on the few available published studies on school bus crashes and injuries vary widely depending on the source of information and study population. In the same study, which was conducted in Iowa, they found that the school bus fatality rate was 0.4 per 100 million miles traveled. The study concluded by recommending that safety of school bus transportation over other vehicles should be a factor in making school transportation policies. McGeehan et al. (2006) reported that there were an estimated 51,100 school bu–-related injuries treated in U.S. emergency departments from 2001 to 2003 and that head injuries accounted for more than half (52.1%) of all injuries among children<under 10 years of age, whereas lower extremity injuries predominated among children 10 to 19 years of age (25.5%). In their study, Lapner et al. (2003) found that head, neck, and spine are the most common injuries when children are involved in rollover school bus collisions and that additional safet, changes to the current school bus design are needed.

Based on their dramatic effects, analysis of roadway, traffic, human, and environmental factors impacting school bus crashes is needed. One of the known methodologies in studying and analyzing crash data similar to those related to school buses is through statistical analysis. Though one of the popular method, statistical evaluation of factors impacting school bus crashes becomes more challenging due to the rarity of these types of crashes. In connection to crash statistical analysis, various modeling approaches have been proposed to identify the genuine relationship between crash (in general) occurrences and roadway geometrics, traffic characteristics, environmental conditions, and human factors. In contrast, not much effort in terms of modeling has been invested in school bus related crashes. This might be contributed by many factors, one of them being data availability as many school buses use local roads whose crash data may not easily be available. Other factor may be unavailability of school bus counts as an exposure variable in calculating crash rates. Furthermore, adequate modeling methodology that takes into consideration extreme rarity of these types of crashes may have hindered study progress.

However, in the past two decades, the Poisson, negative binomial (NB, and various model extensions, such as zero-inflated, have been extensively studied and applied to modeling general crash data, Chimba et al. 2010. Fairly comprehensive reviews were given by Maher and Summergill (1996) and Lord et al. (2005). To make necessary connections with the scope of work proposed in this paper, though, several important milestones in model developments and significant progresses made in recent years are relevant to the discussion. Jovanis and Chang (1986), Joshua and Garber (1990), Chimba et al. (2010) and Miaou and Lump (1993) compared model fitness using the Poisson and multivariate linear regression models, concluding that the Poisson outperforms the multivariate linear regression model due largely to its more appropriate statistical properties for describing non-negative discrete data like crashes. However, they noted that if the crash data reveal significant over-dispersion around the estimated mean, the Poisson model becomes inadequate and more general distributional models, such as the negative binomial (NB), are needed.

Over-dispersion is a phenomenon which occurs when the model is fitted using Poisson or negative binomial. Hardin and Hilbe (2001) listed the following as the source or over-dispersion in the data or model:

  • when some important independent variables are omitted from the model;
  • when the data contains a lot of outliers resulting either from unreliable data collection or mistake and errors during data recording;
  • when the model fails to include sufficient number of interaction terms;
  • when the variable by itself is not appropriate and it needs transformation;
  • if the distribution assumed is quite different from the real distribution which relates the data, e.g., using linear model instead of quadratic.

The earliest recognition of this limitation dates back to Maycock and Hall (1984) who used the NB model for analyzing crashes at junctions. The NB distribution, also known as the Poisson-Gamma distribution, is derived by conjugating the Poisson distribution with the Gamma distribution in which the true mean is assumed to account for extra data variability (over-dispersion). The over-dispersion parameter, in an earlier stage, is assumed identical across all roadway segments, Hauer et al. (1989), Bonneson and McCoy (1993), Miaou (1994) and Shankar et al. (1997). Inspired by Cameron and Trivedi (1986), Maher and Summersgill (1996) adopted the site-dependent (indexed by i) over-dispersion factor (φi) to account for extra variability around the estimated mean (μi). The over-dispersion parameter is linked with the estimated mean in the form of φi=φμid, where d is an additional parameter that can be estimated along with all other parameters. Hauer (15) demonstrated that, when roadway segments differ in length, parameters estimated by the maximum likelihood estimation (MLE) method will be unduly influenced by very short segments if the over-dispersion parameter is assumed to be equal across all roadway segments. The most plausible remedy suggested by Hauer is to set the over-dispersion parameter in proportion to the segment length (L), i.e., for segment i, φi=φLid, where d=1. When d≠1, the sum of crash estimates from su-segments based on the empirical Bayes (EB) method will not be consistent with the estimates for the roadway as a whole. As an alternative to the estimated mean function, one could also model the over-dispersion parameter as a separate function of explanatory variables different from the mean prediction function (2003). They found that the dispersion parameter varies by site (intersection) as a function of the approach flows, the ratio of the flows from minor and major approaches, and geographical locations. Ignoring this variation, namely, treating the over-dispersion factor as a fixed parameter, can significantly undermine the fitness of the estimation. However, they found the parameter estimates only change slightly with the specific crash data used.

As a plausible extension from the Poisson family, the zero-inflated has been applied to model the over-dispersion phenomenon due to an excess of zero observations, Miaou (1994), Shankar et al. (1997) and Lee and Mannering (2002). The concept originated from a mixture of distributions, where the split parameter can be modeled as a constant (Johnson and Kotz (1969) and Washington et al (2002)) or, because it is bounded between 0 and 1, a logit function of the estimated mean (Miaou (1994) and Mullahy (2001)). The relative effectiveness of these two treatments remains to be investigated although it has been demonstrated that both of them could make great contributions toward enhancing model performance.

Zero-Inflated (ZI) Distribution

Recently, zero inflated has surfaced as a plausible approach for use in crash analysis. The use of zero-inflated has been justified from the fact that Poisson and Negative Binomial (NB) models with or without their extensions as well as several variations seem to model non-negative discrete response variables, with over-dispersion and the underlying assumption that the occurrence of crashes observed at a given time and space scale follows a Poisson process. Lord et al. (2005) challenges this assumption by arguing that the occurrence of crashes is in fact a binomial process, which can be approximated by a Poisson process when the number of trials (e.g., traffic exposure) is large with a small likelihood (risk) of crashes. This argument roots from modeling crashes as a dual-state data generation process or a series of Bernoulli trials, i.e., the outcome of a school bus entering into a roadway section is either perfectly safe, involving no crashes, or unsafe, assuming crashes do take place (Lord et al. (2005), Shankar et al. (1997) and Qin et al. (2004)). If the probability of independent events is the same for all trials, the dual-state data generation process will naturally give rise to a binomial distribution for describing school bus crash frequencies. However, the equal probability assumption is questionable in reality since the crash-involved probability varies across roadway segments as a function of drivers, traffic and roadway geometric characteristics among others; for this reason, further research into modeling the unequal probability of independent events is needed.

As discussed in previous sections, zero-inflated models are mainly used for modeling excessive zero count data. Zero count may refer to the situations where the likelihood of an event occurring is extremely rare se.g., school bus crashes) in comparison to normal expectatio, (Cameron and Trivedi (1986), Lee and Mannering (2002), and Mullay (1986)). Zero-inflated (ZI) can be modeled as Zero Inflated Poisson (ZIP) or as Zero Inflated Negative Binomial (ZINB) models. Poisson regression model probability function is given in the following form (14)

(1)   P(yi) = (e μiyi) / (yi!)  yi=0,1,2... and μ>0  

The mean parameter

E[yi / xi] = μ = exp(xi β), Variance = μ,

Where yi = a random variable representing number of school bus crashes,

xi = Parameter related to the occurrence of school bus crash (Vector of explanatory variable)

ß = the coefficient of the corresponding factor (vector of estimable parameter).

For the ZIP model, it assumes that the events yi=(y1, y2…..yN) are independent and the model is

Pr [yi=0] = ϕi + (1-ϕi)ei

(2)   Pr [yi=r] = (1-ϕ) ((ei μir)/(r!)),r=1,2...n

Where φ=proportion of zeros.

Maximum likelihood estimates are used to estimate the parameters of the ZIP regression model and confidence intervals are constructed by likelihood ratio tests.

The negative binomial (NB) model can be expressed as:

(3)   p(y)=(Γ(y+α-1))/(Γ(α-1)Γ(y+1)) (1/(1+αμ))1/α (αμ/(1+αμ))y

where the mean μ = E(y)=vexp(Xβ).

The corresponding variance is Var(y)=μ+αμ2. Similar extensions to the NB model are considered, including the zero-inflated model (ZINB) with constant and mean-dependent split parameters, and the mean-dependent over-dispersion factor.

The ZINB model with constant split parameter (Shankar et al. (1997) and Washington et al. (2002)) can be expressed as:

(4) { γ+(1-φ)Py=0=φ+(1-φ)(1/(1+αμ)) Y=0
      {(1-φ)PYmy Y=1,2,3,...

Furthermore, appropriateness of using the zero inflated model rather than the traditional model, Poisson or Negative Binomial can be tested. The common known test statistic is through Vuong's Value, estimated as shown below, Washington et al. (2002);

(5)   mi= ln {(f1(yi/Xi))/(f2(yi/Xi))}

Where f1(yi/Xi) is the probability density function for one model, say Zero-Inflated Negative Binomial, ZINB) and f2(yi/Xi) is the probability density for comparison model, say Standard Negative Binomial, NB)

V=(√n(m))/Sm, where

(6) m=Mean=[(1/n) n∑i=1 mi],

Sm=Standard Deviation, n=Sample Size, and V=Vuong's Value.

If Absolute(V)<Vcritical(1.96 for 95% Confidence Interval), the test does not support the selection of one model over the other. Large Positive values of V greater than Vcritical, e.g. V> Vcritical favor first model over second model whereas large negative values support second model.

Descriptions of Data and Variables

Data for this study originated from Tennessee Department of Transportation (TDOT) through Tennessee Roadway Information Management System (TRIMS) database system. This database includes crash data comprising attributes such as harmful event, contributing causes, injury severities, traffic characteristics and geometric characteristics among others. The data have the exact log mile where the crash occurred. Furthermore, crashes are listed if they are school bus related or not. Downloaded crashes were therefore screened to identify only those that were school bus related. School bus only crashes occurring on state roadways (SR) in Davidson and Shelby counties were considered in this study (Davidson and Shelby are the most populous counties in the state of Tennessee, hosting the largest two cities of Nashville and Memphis, respectively). Local streets ann non-state roads were not included in the analysis due to unavailability of the traffic counts, an essential element in crash modeling. A total of 493 school bus crashes for the span of eight years from 2002 to 2009 with an average of 61.6 crashes per year were gathered for the study as shown below. It should be noted that the data for year 2002 were not complete; hence, if taken out of consideration, the average crash frequency per year becomes 67.7 crashes per year.

Apart from crash data, roadway geometrics of the analyzed roadway segments were also downloaded from the TRIMS database. Roadway geometry is defined by the beginning and end log miles. Utilizing the county and route number which are present in both crash and geometric data, total number of crashes for each segment were tallied utilizing a small written computer program in Stata software. The written program had the capability of searching the route ID and segment boundaries (beginning and end log miles) in the geometric data, then matching and tallying the corresponding number of crashes by counting log miles within that particular segment. It then merges the two data into a single dataset. Overall, 1903 roadway segments ranging from 53 ft (0.01 mile) to 6.508 miles were identified for the study. As expected, most of the segments had zero school bus crashes (83.8%), followed by one crash (11.8%) and 2.89% for two crashes. Only 1.54% of the segments had more than two school bus crashes for the analyzed eight years span. The presence of excessive zero crash segments supports the use of zero inflated model as a modeling distribution for this study.

Table 1 summarizes the statistics on crash, roadway features, and categorical variables created as a derivative of other variables. For estimation purposes, some variables were modeled as indicator (categorical) variables as listed in table 1-such as posted speed limit (35 mph or below, 40 mph to 45 mph, and 50 mph to 55 mph) and so forth. In addition, a new variable called directional peak hour volume (DPHV) per lane was created to represent the traffic intensity during rush hours, which is computed as the product of AADT, directional split, and K-factor divided by number of lanes. Note that DPHV might be correlated with AADT to some extent but the causal effects are different from AADT. The former characterizes the effect of congestion on crashes, while the latter represents an exposure measure for crashes. Another variable created was vehicle miles of travel (VMT) as the product of AADT and length of the segment.

Empirical Distributions of Crashes

The crash frequency distributions (line and histogram) were first analyzed and fitted with the over-dispersed distributional models, including Poisson, NB, and their zero-inflated versions, to determine if the data was over-dispersed. As can be seen in Figure 1, NB and ZINB models closely follow the observed crash distribution compared to Poisson and ZIP. The over-dispersion factor tested highly significant, indicating the school bus crash data was over-dispersed. This was furthermore confirmed by alpha log-likelihood ratio test (in table 2).

year 2002 2003 2004 2005 2006 2007 2008 2009 Total Average/Year
(without 2002 data)
Crashes 19 72 67 69 85 69 59 53 493 67.7

In comparison, data were fitted with the non-over-dispersed models, including, Poisson, and zero-inflated Poisson (ZIP). In contrast, the Poisson gave the least desirable results. The ZIP model yielded good fit to the zero observation, but fitted the rest of observations rather poorly. This indicates that mixing the nonover-dispersed model with a simple spike mass function at zero-crash observations is not sufficient to produce a satisfactory fit when the distribution is highly skewed. Apart from the probability plots, decision of whether to use a Poisson or Negative Binomial can also be based on the dispersion parameter, σd by Poisson error structure (23) given as;

σd =  Pearsonχ2/(n-p), where n is the number of observations, p is the number of model parameters, and Pearsonχ2 is defined as

(7)   Pearsonχ2 = nΣi=1 |yi-E(yi)|2 / Var(yi)

Where yi is the observed number of accidents on section i, E(yi) is the predicted accident frequency for section i, and Var(yi) is the variance of accident frequency for section i. If σd turns out to be significantly greater than 1.0, then the data has greater dispersion than is explained by Poisson distribution, and Negative Binomial regression model is fitted to the data, Hardin and Hilbe (2001). The crash over-dispersion parameter was found to be 3.06 favoring NB over Poisson regression. The following distributions were therefore retained for final model estimation:

  • Standard NB
  • Zero Inflated Negative Binomial

TABLE 1 Study Data Summary

  Numerical variables
mean min max
Number of crashes 0.2 0 7
Segment length 0.2 0.01 6.508
Daily volume (AADT) 23,062 1,190 61,230
Percent of peak hour volume 9 4 14
Directional split 61.6 50 100
VMT 4,693 24 70,265
DPHV 370 46 1,234
Number of through lanes 4 2 8
Median width 5.8 0 60
Lane width 11.6 9 15
TWLTL width 4.3 0 24
Outside shoulder width 4.3 0 32
Inside shoulder width 0.3 0 24
Posted speed limit 40 15 55
School speed limit 16.9 15 30
  Categorical variables
Street lights 0-No lights, 1-Lights present
Posted speed limit categories 0-No posted speed limit up 35 mph, 1-(40-45 mph Posted Speed) and 2-(50-55 mph posted speed)
Presence/absence of school speed limit 0-No school speed limit, 1- School speed limit
Terrain 0-Flat, 1-Rolling or Mountain
Land use 0-Commercial, 1-Mixed residential & commercial, and 2-Residential
Presence or absence of median 0-No median, 0-There is a median
Median composition 0-No median, 1-Concrete, 2-Grass plot, 3-Painted
Presence or absence of TWLTL 0-No TWLTL, 1-There is TWLTL
Presence or absence of outside shoulder 0--No outside shoulder, 1-There is outside shoulder
Outside shoulder composition 0-Ditch, 1-Asphalt, 2-Gravel and dirt, and 3-Concrete
Presence or absence of inside shoulder 0-No inside shoulder, 1-There is inside shoulder
Presence or absence of curb & gutter 0-No curb & gutter, 1-There is curve and gutter
Presence or absence of sidewalk 0-No sidewalk, 1-There is a sidewalk

FIGURE 1 Probability Histogram Plots of the Poisson, NB and Zero-Inflated Models

FIGURE 1 Probability Histogram Plots of the Poisson, NB and Zero-Inflated Models

Results on Model Estimation

Based on the findings from fitting the empirical distributions, the NB and its zero-inflated version were therefore utilized for model estimation. Estimations were performed using Stata (Stata Corp LP (2008)) software. The z-statistic was used to assess the statistical significance of the variables. As expected, not all variables were statistically significant. The adequacy of individual models and effectiveness of the model extensions were evaluated based on the log-likelihood test. Beyond that, the corrected R2, reasonableness of the fitted values, such as their mean and maximum, were also monitored. Finally, the normalized Bayesian Information Criterion (BIC) was used to assess the most appropriate model. The BIC can be defined as:

(8) Normalized BIC = (Log-likelihood) / N - P/2 In(N)/N

Where P and N are the number of parameters and samples, respectively. Although the distributional models under consideration may have different scales for the log-likelihood function value, this criterion provides a preliminary model assessment within the same distributional model family.

ZINB And NB Model Fitness Performances

Model performance results are summarized in table 2. As shown in the results

  1. Vuong test of ZINB vs. standard NB was found to be 2.31 favoring ZINB over NB
  2. Sign and magnitude of the variable coefficients were almost identical in both the NB and ZINB models, but differed in standard error and z-statistics.
  3. The ZINB standard errors were observed to be slightly lower (tighter) compared to those in NB indicating ZINB was well fitted compared to the later.

The results clearly suggested that the zero-inflated model with mean-dependent over-dispersion factor performed better for school bus crash data modeling compared to other models tested. On the other end, comparing the two models, one can argue that it requires more computational time for ZINB than the NB model primarily due to the necessary intensive computation of the proportion of zeros. In addition, the Hessian matrix for determining the z-test values is much more complicated in the ZINB model.

From the "inflate" part of the ZINB model in table 2, the probability of being in zero school bus crash segments "φ" is determined as shown in equation 9. The inflate model is used to calculate the probability of zero crashes which inflates the number of zeroes then used in equation 10.

(9)   φi=(e(-45.54 - 0.207 * MW + 3.45*PPHV - 16.9*PCC - 31.13*TWLTL))/(1+e(-45.54 - 0.207 * MW + 3.45*PPHV - 16.9*PCC - 31.13*TWLTL))

The ZINB first estimates the effects of the independent variables on the crash frequency; these coefficients are interpreted just like standard NB coefficients. Secondly it estimates the equivalent of a binary logit model where the outcome variable is the log odds of being in the zero-school bus crash segment compared to being in the non-zero crash segments. The coefficients in the inflate part; labeled "INFLATE" (table 2) correspond to the binary model predicting group which considers probability of zero crashes. It specifies the model that determines whether the observed count is zero. These coefficients are interpreted just as the coefficients for a binary logit model. A positive coefficient means the independent variable has the effect of increasing the odds that the dependent variable equals a given value, usually 1 for binary dependents. A negative coefficient means that the independent variable has the effect of decreasing the odds that the dependent variable equals the given value. Utilizing the probability of zero crash model "inflate" in equation 9, school bus crash frequency are then predicted as follows;

Analysis of the Model Results with Respect to Crash Attributes

As stated earlier, the primary objective of this modeling effort was to evaluate school bus crash frequency as a function of explanatory variables. The school bus crash frequency here is defined as the number of crashes per year. The model performs reasonably well in term of the reasonableness of the model coefficients with respect to school bus crashes. As expected, roadway segments with posted school zone speed limit (PSSL) has negative coefficient with strong z-value showing its significance in reducing school bus crashes. Some literatures have pointed out traffic volume as surrogate for congestion and characteristics of increasing crash. As a common wisdom, AADT showed positive coefficient which supports previous findings (though not necessarily for school bus crashes) by Miaou and Lump (1993), Miaou (1994), Garber and Ehrhart (2000), Chimba et al. (2010) and others. As the number of vehicles per lane increases on the highway, fewer gaps allow lane changing, turning movements, or merging, which eventually increase likelihood of crashes. The analyzed segments ranged between a minimum of 1,190 vpd to 61,230 vpd AADT. Though not directly found from this study, a slight change in traffic characteristics can have noticeable impact to school bus operations which eventually can lead to risk of accident. Most often the school buses operate during peak hours (especially morning rush hours) where interaction with congested passenger cars becomes very common and may lead to scenario or likelihood of crashes.

(10) School Bus Cashes = (1-ϕi)*e[-0.66+0.0000026AADT+0.026NL-0.146LW-0.008LW-0.004OSW-0.059ISW+0.434TWLTL+0.207SL1 +0.285SL2 + 0.249PSSL-0.367PM+0.135PCC-0.556POS]

The coefficient of the percentage of peak hour volumes (PPHV) is positive in the inflated portion, which is consistent with some previous research findings, Shankar et al. (1997) and Ivan et al. (1999). This suggests that as the percentage of peak hour volumes increases, the likelihood of school bus crashes increases too. Number of lanes (NL) appears in "crash" portion only of the ZINB with a significant positive coefficient indicating increasing likelihood of school bus crashes if the segments have more lanes. Most of the previous studies also concluded with the same findings that the higher the number of lanes, the higher the crash rate, Noland and Oh (2004), Aty and Radwan (2000), Chimba et al. (2010) and Garber and Ehrhart (2000). As a general rule more lanes roadway sections are associated with more flow per lane which can be unsuitable for school bus safety.

Posted speed limit appeared to be significant only on the crash model portion and non-significant in the "inflate" portion. Both 40–45 mph (SL1) and 50–55 mph (SL2) posted speed limit categories have positive coefficients indicating the tendency to increase the probability of school bus crashes compared to lower 15–35 mph speed segments or no speed limit sections. However, the coefficient of 50–55 mph speed limit is slightly larger than that of 40–45 mph, denoting the higher the posted speed limit, the riskier are the school buses to be involved in a crash. The conclusion which can be drawn from the speed limit to school bus safety is school buses should avoid high speed routes.

Lane width (LW) was found to be significant only to the crash part of the developed ZINB model and not to the "inflate" portion. The variable was incorporated with the intention of evaluating how lane width affects school bus crash frequency. The coefficient of lane width is negative in the model with z-value of -1.84 which approximately 93% significant level. This means wider lanes are likely to reduce school bus crashes compared to narrow lanes. One can suggest that wider the lane provide extra separation between the vehicles which can give school bus buffer deviation in case of crash leading incidents. In addition, the buffer between vehicles can provide a room for the driver to correct before the crash. Wider lanes can as well give the driver more driving comfortability then reduce delays and improve capacity. On the other part, the study used median under two scenarios; median width (MW) and presence or absence of the median (PM). Median width was found to be a significant variable in both crash and inflate portions of the ZINB. Presence of median has a negative coefficient, the depiction that the school bus passing on the segments with medians will have low probability of being involved in crash compared to "no-median" (undivided) segments. Furthermore, roadway segments with wider medians according to model outputs are shown to be safer (negative coefficient) compared to narrow median segments.

TABLE 2 Modeling Results Using the ZINB and NB Models

CRASHES ZINB Negative Binomial (NB)
Coef. Std error Z-Value Coef. Std error Z-Value
AADT 2.60E-05 6.29E-06 4.2 2.90E-05 6.38E-06 4.5
Number of lanes (NL) 0.026 0.0139 1.89 0.028 0.0175 1.59
Lane width (LW) -0.146 0.0794 -1.84 -0.144 0.0809 -1.78
Median width (MW) -0.008 0.0041 -1.89 -0.008 0.0043 -1.75
Outside shoulder width (OSW) -0.004 0.0021 -1.99 -0.002 0.0021 -1.13
Inside shoulder width (ISW) -0.059 0.0309 -1.92 -0.055 0.0314 -1.75
Presence of TWLTL 0.434 0.1497 2.9 0.484 0.1509 3.21
Speed limit 40-45 mph (SL1) 0.207 0.11 1.88 0.206 0.1115 1.85
Speed limit 50-55 mph (SL2) 0.285 0.0993 2.87 0.262 0.1323 1.98
Presence of school speed limit (PSSL) -0.249 0.064 -3.92 -0.265 0.07 -3.77
Presence of median (PM) -0.367 0.1954 -1.88 -0.344 0.2058 -1.67
Presence of curb & gutter (PCG) 0.135 0.0708 1.9 0.191 0.1028 1.86
Presence of outside shoulder (POS) -0.556 0.189 -2.94 -0.579 0.1937 -2.99
Constant -0.66 0.8916 -0.74 -0.832 0.9138 -0.91
Length (offset) (offset)
Median width (MW) -0.207 0.136 -1.52 NA
Percent of peak hour volume (PPHV) 3.454 1.229 2.81
Presence of curb & gutter (PCG) -16.901 8.326 -2.03
Presence or absence of TWLTL -31.127 15.486 -2.01
Constant -45.535 15.488 -2.94
Length (offset)      
Alpha 1.376     1.524    
Likelihood-ratio test of alpha=0     18.52     121.78
Vuong test of ZINB vs. standard NB     2.31   NA  

As stated, median (MW and PM), outside shoulder widths (OSW) and presence of outside shoulder (POS) both have negative coefficients. For the school bus safety point of view, these cross-sectional elements can be used for emergency stops, hence the wider they are the better for the driver correction and for bus safety in general. Inside shoulder width (ISW) is significant with a negative coefficient, indicating likelihood of bus crashes decreases as shoulder width increases. These findings are consistent with some previous studies on general crashes, Chimba et al. (2010), Milton and Mannering (1998), Aty and Radwan (2000), and Lee and Mannering (2002). From a highway safety standpoint, shoulder can be used for vehicle stopping in an emergency or during an incident, and drivers can take advantage of wider shoulders to avoid hitting roadside obstacles. Besides that, wider shoulders can also be used for deceleration to avoid a crash.

Median composition (no median, concrete barrier, planted grass or painted) does not appear in the final model due to insignificancy. Though not appearing in final model, they are still considered very crucial factors with respect to school bus crash. Painted and grass/ lawn median (without concrete barrier) can allow the school bus to accelerate further from the main travel way which can help avoid a crash. As shown in table 2, two-way left turn lane (TWLTL) median generated mixed results. In the crash portion of the ZINB, this variable has positive coefficient, the sign that presence of TWLTL creates safety concerns for school bus crashes. As raised median play a very vital part in separating opposing traffic streams and reduces access from the mainline, TWLTL does the opposite. The environment in which vehicles use the TWLTL for making U-turns or for left turns access is seen as a generator of conflict points especially for school buses. However TWLTL is negative in the "inflate" part of the model.

Apart from drainage purposes, presence of curb and gutter (PCG) can be considered to prevent vehicles from accelerating beyond the travel lane. Table 2 shows PCG to be significant in both "crash" and "inflate" parts of the ZINB. The variable is positive in the "crash" portion but negative in the "inflate" portion. The common wisdom could have considered the presence of curb and gutter to reduce probability of school bus crashes, but in most cases segments with curb and gutter also have sidewalks which may skew the expectations. As sidewalks accommodate pedestrians, one could expect segments with curb and gutter to be associated with minor crashes resulting from hitting the curb walls. The positive coefficient might have been caused by the effect of population which also do influence crash frequency. Roadway segments in densely populated areas and which have high density of school buses often have curb and gutter compared to low population areas. Effect of curb and gutter as well as sidewalks can furthermore be linked with adjacent land uses.

Finally, it should be noted that school buses usually run in a designed route and mostly during peak hours to pick up and drop off students. Most of the bus drivers are familiar with the hazards along the route with the exception of new drivers. That means, some of bus crashes may therefore have nothing to do with road characteristics as found in modeling results discussed.


Based on the postulation that crashes follow a Bernoulli process with an unequal probability of independent events, the authors applied the zero-inflated negative binomial (ZINB) distributional model to analyze school bus crash data. The ZINB model is derived in Bayesian statistics by conjugating the negative binomial model. Statistically, the ZINB model could explain extra zero variability beyond the standard negative binomial model and is therefore capable of modeling a wider range of data variability especially for school bus crashes which are very rare than the Poisson and NB models. Comparison of ZINB and NB variable coefficients yielded almost similar sign (negative or positive coefficients) but ZINB resulted with slightly tighter standard errors hence slightly stronger z-values compared to NB. This recaps that for crash data such as those related to school buses, the use of zero inflated models is more applicable in lieu of standard count models. The results show that the ZINB model with a mean-dependent over-dispersion factor yields better performance and is recommended for use in school bus crash data modeling, with the caveat that the model may slightly over estimate the mean of crash frequency. This implies that, in high-risk roadway segments with low traffic exposure, the occurrence of school bus crashes will possibly deviate from the Poisson or NB process. In this case, the ZINB becomes more appropriate and provide more flexible modeling framework.

During the modeling process, some of the explanatory variables showed inconsistent signs and significance levels. The study found that high traffic volume (AADT), more number of through lanes (multilane section), presence of two-way left turn lanes, high posted speed limit sections, presence of curb and gutter have positive effect in increasing the probability of school bus related crashes. On the other hand, wider lanes, median, outside shoulder, inside shoulder, presence of posted school speed limits, presence of median (divided roadways), and presence of outside shoulders both have negative coefficients describing their effect in reducing the chances of school bus related crashes.

The study analyzed factors to consider when planning for school bus routes. Minimization of school bus crashes will be achieved by avoiding routes with roadway cross-section features found to influence probability of crashes. For instance, as found in the study, a school bus route planned through a multilane roadway with congested traffic volume, higher posted speed limits, TWLTL will have higher probability of being involved in a crash compared to section with opposite characteristics.

As stated in previous sections, local streets and non-state roads were not included in the analysis due to unavailability of the traffic counts, an essential element in crash modeling. Future studies should evaluate school bus crashes on these roadway classes, the findings might vary due to low traffic volumes, low operating speed as high volume of pedestrians.


Aty, M. A. and A. E. Radwan, 2000. Modeling traffic accident occurrence and involvement. Accident Analysis and Prevention Vol. 32, No. 5, 633–642.

Bonneson, J. A. and P. T. McCoy, 1993. Estimation of safety at two-way stop-controlled intersections on rural highways. Transportation Research Record Vol. 1401, 83–89.

Cameron, A.C. and P.K. Trivedi, 1986. Econometric models based on count data: comparisons and applications of some estimators and tests. J. Appl. Econometrics Vol. 1, 29–53.

Chimba, D., T. Sando, and V. Kwigizile, 2010. Effect of bus size and operation to crash occurrences. Accident Analysis & Prevention Vol. 42, (6), 2063–2067.
Garber, N. J. and A. A. Ehrhart, 2000. The Effect of Speed, Flow, and Geometric Characteristics on Crash Rates for Different Types of Virginia Highways. Virginia Transportation Council.

Hardin, J. and J. Hilbe, 2000. Generalized Linear Models and Extension. Stata Corp.

Hauer, E. 2001. Overdispersion in modeling accidents on road sections and in empirical Bayes estimation. Accident Analysis and Prevention Vol. 33, 799–808.

Hauer, E., J.C.N. Ng, and J. Lovell, 1989. Estimation of safety at signalized intersections. Transportation Research Record Vol. 1185, 48–61.

Ivan, J. N., R. K. Pasupathy, and P. J. Ossenbruggen, 1999. Differences in causality factors for single and multi-vehicle crashes on two-lane roads. Accident Analysis and Prevention Vol. 31, No. 6, 695–704.

Johnson, N. L. and S. Kotz, 1969. Discrete Distribu-tions: Distributions in Statistics. Wiley, New York, N.Y.

Joshua, S. C. and N. J. Garber, 1990. Estimating truck accident rate and involvements using linear and Poisson regression models. Transportation Planning and Technology Vol. 15, 41–58.

Jovanis P. P., and H. L. Chang, 1986. Modeling the relationship of accidents to miles traveled. Transportation Research Record Vol. 1068, 42–51.

Lapner, P. C., D. Nguyen, and M. Letts, 2003. Analysis of a school bus collision: mechanism of injury in the unrestrained child, Can. J. Surg. Vol. 46, No. 4, 269–272.

Lee, J. and F. Mannering, 2002. Impact of Roadside Features on the Frequency and Severity of Run-off-Roadway Accidents: Empirical Analysis. Accident Analysis and Prevention Vol. 34, 149–161.

Lord, D., S.P. Washington, and J. N. Ivan, 2005. Poisson, Poisson-gamma and zero-inflated regression models of motor vehicle crashes: balancing statistical fit and theory. Accident Analysis and Prevention Vol. 37, 35–46.

Maher M. J., and I. Summersgill, 1996. A Comprehensive Methodology for the Fitting of Predictive Accident Models. Accident Analysis and Prevention Vol. 28, 281–296.

Maycock, G. and R. D. Hall, 1984. Accidents at 4-arm roundabouts. Laboratory Report LR 1120, Crowthorne, Berks, U.K., Transport Research Laboratory.

McGeehan, J., J.L. Annest, M. Vajani, M.J. Bull, P.E. Agran, and G. A. Smith, 2006. School bus-related injuries among children and teenagers in the United States 2001–2003. Pediatrics Vol. 118, No. 5, 1978–1984.

Miaou, S., 1994. The relationship Between Truck accidents and Geometric Design of Road Sections: Poisson versus negative Binomial Regressions. Accident Analysis and Prevention Vol. 26, 471–482.

Miaou, S. and D. Lord, 2003. Modeling traffic crash-flow relationships for intersections: dispersion parameter, functional form, and Bayes versus empirical Bayes methods. Transportation Research Record Vol. 1840, 31–40.

Miaou, S. and H. Lump, 1993. Modeling Vehicle Accidents and Highway Geometric Design Relationships. Accident Analysis and Prevention Vol. 25, 689–709.
Milton, J. and F. Mannering, 1998. The relationship among highway geometrics, traffic-related elements and motor-vehicle accident frequencies. Transportation Vol. 25, 395–413.

Mullahy, J., 1986. Specification and testing of some modified count data models. J. Econometrics Vol. 33, 341–365.

Noland, R.B. and L. Oh, 2004. The Effect of Infrastructure and Demographic Change on Traffic-Related Fatalities and Crashes: A Case Study of Illinois County-Level Data. Accident Analysis and Prevention Vol. 36, 525–532.

Qin, X., J. N. Ivan, and N. Ravishanker, 2004. Selecting Exposure Measures in Crash Rate Prediction for Two-Lane Highway Segments. Accident Analysis and Prevention Vol. 36, 183–191.

Sawalha, Z., 2003. Statistical Issues in Traffic Accident Modeling. In Proceedings of the 82th Annual Meeting of the Transportation Research Board, January 12–16, Washington, D.C.

Shankar, V., J. Milton, and F. L. Mannering, 1996. Modeling accident frequency as zero-altered probability processes: an empirical inquiry. Accident Analysis and Prevention Vol. 29, No. 6, 829–837.

StataCorp LP, 2008. Data Analysis and Statistical Software. College Station, Texas.

Washington, S.P., M.G. Karlaftis, and F. L. Mannering, 2002. Statistical and Econometric Methods for Transportation Data Analysis. Chapman & Hall/CRC.

Yang, J. C., Peek-Asa, G. Cheng, E. Heiden, S. Falb, and M. Ramirez, 2009. Incidence and characteristics of school bus crashes and injuries. Accident Analysis and Prevention Vol. 41, 336–341.

Updated: Friday, May 19, 2017