BARAK FISHBAIN OFFER GREMBEK

Department of Environmental, Water and Agricultural Engineering, The Faculty of Civil & Environmental Engineering, Technion-Israel Institute of Technology Haifa 32000, Israel Safe Transportation Research and Education Center, University of California, Berkeley, CA 94720-7374, USA

**KEYWORDS**: traffic safety, vehicles crashes, graph cuts, clustering, fatal accidents

### Abstract

Road fatalities are rare outcomes of events that occur in a small time-space region. Although the exact chain of events for each fatality is unique, there are inherent similarities between road fatalities. The science of road safety is dedicated to identifying such similarities, mainly using statistical analysis tools. Researchers typically analyze patterns that emerge over space, such as hot-spot studies, or patterns that emerge over time, such as before-after studies. Traffic research enumerates 84 parameters that characterize a road fatality. A vast number of papers have tried to find the correlation between one or two parameters. In those studies quite often the contribution of other factors is omitted. In this research we utilize a clustering graph theoretic method, known as graph-cuts, for segmenting a very large crash dataset (i.e., all fatal car crashes in the last 2, 5, or 10 years), while incorporating all available crash information into the process. The analysis of the clusters allows one to find subtle trends and significant causes for traffic fatalities. With this method, for example, we have found high correlation between hit-and-run and pedestrians fatalities, which was overlooked by previous studies. An additional output of the research is a full description of the typical fatality, thus all factors that characterized the representative crash in a cluster.

### Introduction

The study of road crashes and their causes affects many fields, such as vehicle design, road design, transportation planning, law enforcement, policy making, and actuarial science. The identification and analysis of factors affecting the frequency and severity of crashes is a fundamental step in all of the aforementioned applications and disciplines. Traffic crashes are expected to form clusters in the spatiotemporal space as many of the crashes' contributing factors exhibit spatial and temporal patterns (McGuigan 1981; Plug et al. 2011; Shino 2008; Steenberghen et al. 2004). For example, collision frequency is typically tied to traffic volumes, which present spatial and temporal patterns (Eleni et al. 2007). Spatial correlation is typically attributed to highway infrastructure, its access and egress points and its deficient design and maintenance (Black 1991). In this study we apply data cluster analysis for identifying the factors affecting the frequency of fatal crashes.

The spatial distribution of crashes has been investigated mainly in an attempt to identify high collision concentration locations (i.e., hot spots) (McGuigan 1981; Steenberghen et al. 2004). Many studies have utilized point pattern analysis for this task (Bailey and Gatrell 1995; Cressie 1993; Diggle 1983; Fotheringham et al. 2000). This cohort of methods aims at determining whether an observed distribution of point events results from a random pattern or whether there is an underlying mechanism affecting these crashes. These methods include quadrant analysis (Shino 2008), nearest-neighbor analysis (Getis and Franklin 2010), Kernel Density Estimation (KDE) (Xie and Yan 2008), and K-function techniques (Yamada and Thill 2004). Each of these methods produces good results, but suffers from inherent deficiencies. Most of the traditional quadrat and nearest-neighbor analyses (e.g., Bailey and Gatrell 1995; Cressie 1993; Diggle 1983; Fotheringham et al. 2000) do not regard the fact that a traffic crash is an event that is constrained by the transportation network. Shino (Shino 2008) coped with this problem by driving the transportation network into mutually exclusive subnetworks (i.e., network-based quadrats). The complexity of finding a single quadrat is a function of the entire network's size. For n_{e} representing the number of links and n_{p} representing the number of nodes in the network, the complexity of finding a single quadrat is given by O(n_{e}logn_{p}) + O(n_{e}+n_{p}) (Aho et al. 1983, Shino 2008). Furthermore, the problem of finding a set of subnetworks that covers the entire network is essentially the Set Cover Problem (SCP), which is known to be computationally intractable (i.e., NP-Hard) (Alon et al. 2006; Karp 1972; Shino 2008). Hence, the problem is computationally intense and consequentially is not suitable for applications where networks of more than a few dozens of miles are considered (Ang et al. 2012; Shino 2008). KDE methods drape a grid of equalsized cells over the transportation network and perform the analysis in each cell. However, these methods disregard the density of the network links—hence, the number of road sections in each grid cell. This in return biases the results (Steenberghen et al. 2004). The KFunction method produces a global measure, which does not reveal the clusters' locations. Steenberghen et al. (Steenberghen et al. 2010) suggest a moving segment approach for solving this problem. The moving segment measures, at each point of the road network, the level of risk over all the road sections given a specific distance. To allow for the consideration of the network constraints, the distance is measured along the network. The main criticism of spatial methods is that they consider only location-related information. Hence they omit temporal and other data from the analysis.

Of the numerous factors that play a role in collisions, the temporal factors (i.e., time of day, day of the week, and year) are significant determinants. Many different methodological approaches for modeling occurrence of traffic crashes have been developed (Lord and Mannering 2010). Loosely speaking, methods for temporal analysis of traffic crashes occurrence can be categorized into two (Abdel-Aty and Pande 2007): methods which investigate crashes' frequency over long periods of time (Golob and Recker 2003; Hauer 1986; Miaou and Lum 1993; Persaud 1991); and techniques for real-time crash analysis, which determine the probability of crash occurrence in real time (Golob and Recker 2004; Lee et al.). Temporal analysis often addresses specific questions. The distribution of fatal crashes before and after the annual daylight savings time (DST) changes in spring and fall was examined to identify the increased risk to pedestrians in darkness (Sullivan and Flannagan 2001). In this study crashes were taken from a one-hour clock window three weeks before and three weeks after the transitions in both the spring and fall. The key assumption was that traffic conditions are the same in the weeks immediately before and after the changeover to DST, as traffic is principally governed by clock time rather than sunset time. Observed differences in crash levels between these two periods were attributed to the difference in ambient light level, and therefore can be used to quantify the effect of light in fatal crashes. While the research has shown a significant increase in risk to pedestrians, the method is limited to a short span of time. A simple visual exploratory approach to examine the relationship between fatal pedestrian crashes and time of day, day of the week, and time of the year showed that the first two hours of darkness typically presented the greatest frequency of pedestrian fatal collision (Griswold et al. 2011). The temporal correlation of fatal crashes with the drivers' age and alcohol consumption were also investigated within the scope of this study. Road traffic crash study in Christchurch, New Zealand, showed a comparative increase in rates during morning rush hour and during school run, the times students are traveling to and from school - 08.00–09.00 and 15.00–15.30 in this specific study (Kingham et al. 2011). It is important to note that this study failed to find spatial patterns in the data set. However, the search for patterns was held separately for the temporal and the spatial factors.

Other studies, aimed at finding an association between one or more crash attributes, employ various statistical regression analyses. The most common models are multiple linear regression, Poisson regression and Negative Binomial (NB) regression. NB and Poisson models were used for selecting the variables with the highest impact (e.g., Wong et al. 2007) and for predicting crashes (e.g., Lord and Mannering 2010). This work is conducted under a univariate modeling regime, where each variable is a scalar value. Recent work has extended these frameworks to multivariate models, in which each variable is a vector, typically describing different crash parameters (e.g., Ma and Kockelman 2006; Miaou and Song 2005). However, Both Poisson and NB models apply strong assumption regarding the crash data: the Poisson model requires the variance-tomean ratio of the crash data to be about 1, and both Poisson and NB models require the crash data to be uncorrelated in time. As there is a correlation in time ( e.g., Griswold et al. 2011), both models are limited.

Finite mixture/Markov switching models are also a common tool for examining heterogeneous populations (Frhwirth-Schnatter 2006). In these models, the underlying assumption is that the overall data are generated from several distributions that are mixed together and that individual observations can switch among these distributions over time. In recent years, a few researchers have examined the application of finite mixture models to highway safety as they offer considerable potential for providing important new insights into the analysis of crash data. However, these models are quite complex to estimate (Lord and Mannering 2010).

Machine learning techniques were also used for analyzing contributing factors to crashes. Neural network models, such as Back Projection Neural Network (BPNN), have been recently used for modeling and predicting crash data (Abdel-Aty and Pande 2007; Mussone et al. 1999; Xie et al. 2007). These models often over-fit the data, especially when the sample size is small (Vogt and Bared 1998). Methods based on Bayesian Neural Networks (BNN) overcome the over-fitting problem (e.g., Xie et al. 2007) and were found to be more efficient than NB models for predicting crashes. Support Vector Machine (SVM) methods (Li et al. 2008; Zhang and Xie 2008) were suggested for the crash analysis task as well. SVM-based methods present less over-fitting and can be generalized in a simpler fashion than BPNN or BNN (Li et al. 2008). SVM, however, is computationally complex, and analyzing a large data set becomes impracticable.

This paper presents a two-step methodology, which avoids most of the deficiencies associated with the traditional data analysis methods. The method does not apply any assumptions on the data and does not require any tuning stage. A similar notion was recently presented by Depaire et al. (Depaire et al. 2008). In their study, the accidents were grouped by accident type, accident location type, driver age, road type, and vehicle type. While this study excluded analysis subjectivity and estimated a model stochastically, the set of parameters was preselected, which weakens the results of this set of features.

In the method presented here, at the first stage a graph-theory based method is employed to identify the set of road fatalities that are the least different (the typical fatality). By doing this we remove some of the noise originating from the most different fatalities. To obtain this we utilize clustering graph theoretic methods, known as graph-cuts (Ford and Fulkerson 1956) for segmenting crash data-sets. The segmentation is done using all available crash data. Incorporating all the information at hand allows one to find associations between the various factors, which may not have been identified otherwise. At the second stage, each group is analyzed by applying conventional statistical techniques. In this second stage both intra-group and inter-group analyses are conducted. Intra-group analysis allows for finding subtle trends and patterns as the groups are more homogeneous than the entire data set. Inter-group analysis facilitates the examination of the differences between the groups and effective identification of interactions.

### Method

#### Graph Representation of Traffic Crash Data

A graph theoretical framework is suitable for segmentation and grouping problems of multidimensional data in general, and for multivariate crashes data in particular. The segmentation problem is presented on an undirected complete graph G=(V,E), where V is the set of crashes and E are the set of edges connecting two crashes. Each edge carries a weight associated with the similarity of the two crashes it connects (See figure 1). The similarity weight w_{ij}, which is associated with the edge connecting nodes i and j increases as the two crashes i and j are perceived to be more similar. Low values of w_{ij} are interpreted as dissimilarity.

**FIGURE 1 Example of a Complete, Undirected Graph With Edge Weights**

Each node corresponds to a crash, and edge weights reflect the similarity between two crashes

The similarity between two vertices i and j in the graph is determined by a function that takes as input a feature or observation vector x_{i} (for vertex i) and another, x_{j}, for vertex j. In our application, x_{i} is the i^{th} crash's feature vector (e.g., date, time, number of involved vehicles, number of fatalities, manner of collision, driver's alcohol consumption, number of lanes, etc.). The function outputs a single realvalued number, where larger numbers indicate a higher degree of similarity between i and j. Depending on the properties of the feature vectors, a variety of similarity functions can be used. The most commonly used similarity metrics are measures of correlation, such as Pearson's correlation or Kendall's τ rank correlation (Kendall 1938). Euclidean distance or l^{2} -norm is a common measure of *dissimilarity*, where larger distances correspond to greater degrees of dissimilarity. For a, b, and c userdefined parameters, the Gaussian similarity function transforms the Euclidean distance measure to a similarity function as follows:

(1) S_{ij}=S(x_{i},x_{j})=aexp{-b||x_{i}-x_{j}||^{c}}

We opted to use this similarity function due to its intuitive behavior and flexibility, however, any alternative monotonically increasing function ||x_{i} - x_{j}|| in may be used in its place.

#### Cuts on a Graph

We now provide a formal definition for cuts or partitions on a graph, which is a major component of our proposed method.

A bipartition of a graph is called a *cut*, (S,S) = {{ [i,j] | i ∈ S,j ∈ S}} where S is the complement of S, (S=V\S)). This is illustrated in figure 2. The *capacity of a cut *(S,S) is defined as the sum of the weights of all edges between S and S, thus edges that have one endpoint in S and the other in S. In the example given in figure 2, these edges are represented in solid lines. Formally the capacity of the cut is given by:

**FIGURE 2 Example of a Cut on an Undirected, Complete Graph**

The cut is indicated by the dark black line that partitions the node set V into two disjoint sets: S and S . The capacity C(S,S) of the cut is the sum of the weights of the edges that cross the cut (i.e., the sum of the weights of all edges that have one endpoint in S and one in S.

(2) C(S,S) = Σ i∈S,j∈S^{w}ij

More generally, for any pair of sets A, B ⊆ V, we define the set of edges going between these two sets as (A,B) = {[i,j]|i ∈A,j ∈ B}, and the capacity of (A,B) is

(3) C(A,B) = Σ i∈A,j∈B^{w}ij

Consequentially, the capacity of a set, A⊂V is given by:

(4) C(A,A) = Σ i,j∈A^{w}ij

and is denoted by C(A).

Finally, the total sum of the weights of edges from nodes in S⊂V to all nodes in the graph, V, is denoted by d(S) and is referred to as the volume of the set:

(5) d(S) = C(S,V) = Σ i∈S,j∈V^{w}ij

Using the notation above, one can easily see that:

(6) d(S) = C(S,V) = C (S,S) + C(S)

and

(7) C(S) = C(V) - d(S)

where C(V) is the capacity of the graph— hence, the sum of all edge weights in G(V,E), which for a given graph, G(V,E), is constant.

The segmentation problem can be formulated in many ways. One common formulation is to find a partition into two sets which minimizes the similarity between crashes that are associated with different groups. This problem is the minimum-cut problem (Ford and Fulkerson 1956), which oftentimes results in an unbalanced partition (Shi and Malik 2000). Shi and Malik tried to overcome this unbalance problem, by maximizing the dissimilarity between groups and the similarity within a group (Shi and Malik 2000). This is achieved by finding a partition to two nonempty disjoint sets minimizing the quantity called the Normalized Cut (NC):

(8) NC(S,S) = (C(S,S)/d(S)) + (C(S,S)/d(S))

This objective function drives the segment S and its complement to be approximately of equal size. However, there is no efficient way for finding the partition S that minimizes equation (8). Hence, this objective is NP-hard (by reduction from set partitioning (Shi and Malik 2000). In this paper we use a slightly different quantity, NC ":

(9) NC"(S,S)=(C(S,S)/c(S))+(C(S,S)/C(S))

The set ⊂V which minimizes equation (8) also minimizes equation (9). As shown below:

(10) C(S,S)/C(S)+C(S,S)/C(S) = C(S,S)/(C(V)-d(S)) + C(S,S)/(C(V)-d(S))

As C(V) is constant, for a given G(V,E); and as C(V) > d(S) for all nonempty S ⊆ V equation (9) is equivalent to equation (8). This means that the set S, which minimizes equation (8) also minimizes equation (9) and finding it is also NP-hard.

The NC'' problem finds a bi-partition of the graph. The extension of this objective to partition the graph into K segments is given by:

(11) NC"_{K}=min_{S1,S2,S3,...,SK} (ΣKi=1 C(S_{i},S_{i})/C(S_{i}))

For solving the multigroup segmentation, we extend a stochastic approximation scheme, originally suggested by Karger and Stein (Karger and Stein 1996) for solving the minimum-cut problem, for solving the multi-segmentation NC''_{K} problem.

### Stochastic Approximation for the Normalized Cut Problem

This algorithm is iterative. At each step one edge e_{ij} , connecting two vertices, i and j, is randomly selected. The selected edge, e_{ij} , is removed and the vertices i and j are merged into one *meta-node k*. The edges going from any node v in the remaining of the graph to either i or j are replaced with edges connecting v and the new meta-node k. This edge's weight is then given by:

(12) W_{k,v} = W_{j,v} + W_{i,v}

The rest of the graph remains unchanged. An example of an edge contraction is given in figure 3. The algorithm ends when there are K nodes (or meta-nodes) left in the graph. Because the algorithm has a stochastic component in it, the entire process is repeated several times, and the graph partitioning that gives the minimum value of equation (11) is chosen.

**FIGURE 3 Merging Step**

Merging nodes i and j into node k

The merging algorithm is based on the idea that because the number of edges that reside on the cut is small, a randomly chosen edge is unlikely to be part of the cut. In a try to increase the chances that the optimal minimum of (11) is reached, the probability, P(e_{ij}), of choosing an edge e_{ij} is inversely proportional to the maximum value of the NC'' objective of the two nodes it connects, hence:

(13) P(e_{ij}) ∝ 1/max{C(S_{i},S_{i})/C(S_{i},S_{i}),C(S_{j},S_{j})/C(S_{j},S_{j})}

### Data

#### Crash Data

To illustrate the suggested method, we examine 1,573 fatal collisions in California during 2008. The crash data was extracted from the Fatality Analysis Reporting System (FARS), a comprehensive surveillance system of U.S. fatal collisions maintained by the National Highway Traffic Safety Administration (NHTSA). The 1,573 fatalities were chosen arbitrarily by ordering them according to the recorded time of the crash and taking every other crash.

For each fatality, the FARS system holds more than 125 data elements. These elements are reported on four standard forms (Accident, Vehicle, Driver and Person) that include detailed information about persons and vehicles involved in the crash, and the physical environment in which the crash occurred (NHTSA 2004). While the algorithm is not restricted by the number of features incorporated into the crashes' feature vectors, we focus here on all 28 features included in the Accident file (see table 8). This subset consists of the features that describe the crash and its possible contributing factors; time related factors; and road infrastructure descriptors. The objective here is to identify causal attributes of fatal crashes; post-crash attributes were omitted from the analysis. Additional administrative attributes such as case id and milepost were also removed.

#### Similarity Measures

For computing the distance of two feature vectors, as a measure for similarity (inverse proportion—see Section 2.1), we consider the nature (or type) of the different variables. For all types of variables the distance is normalized so the maximum possible distance is 1. The distance between any two crashes is computed by summing both crashes' features' distances.

The distance of time-related variables (month, day of the month, hour and minute) between two crashes is computed in a cyclic fashion. Hence, the distance between a crash that occurs on the 31st to a crash that took place on the 1st, is the same as the distance between the latter and a crash on the 2nd. Similarly, the distance between a crash that is reported at 11 pm and a crash at midnight is the same as the distance between midnight and 1am.

Out of the 28 features considered in this study, three categorical variables were grouped due to their apparent association. For these variables the possible reported values are first grouped into a small number of groups. If two crashes' values fall into the same group, then the distance is 0, otherwise it is 1. For example for manner of collision, we determine 5 categories: collision with a fixed object, collision with other road user, car malfunction, driver's fault and non-collision. For hit and run we designate two groups: hit and run has occurred (hit and run after collision with motor vehicle in transport, after collision with pedestrian, after collision with parked/stopped car, etc.) and no hit and run. Days of the week are grouped into four groups: weekdays (Tuesday through Thursday), Monday, Friday and weekend (Saturday and Sunday). The reason for this grouping is that Monday, Friday and weekends exhibit different crash patterns ( e.g., Griswold et al. 2011).

For all other variables, if they are the same then the distance is 0, if they are different then the distance is 1. By doing this we avoid hard questions such as whether the distance between 1-lane road and 2-lanes road is the same as the distance between 2-lanes to 3-lanes roads. Other similar questions are the distance between different roadway surface conditions and number of fatalities.

As described in Section 2.1 the weights on the graph represent the similarity between two crashes rather than the distance. To shift from distance to similarity, we use equation (1), where a = 1, b = 0.1 and c = 1. These values were selected after a binary search and they provide the lowest value for NC"_{k},equation (11).

### Results

Each crash in the data set is characterized by 28 features grouped into three types:

**"what" variables:** include manner of collision, harmful event, number of fatalities, number of people involved, drunk drivers, number of pedestrians, number of involved vehicles in transport, total number of vehicles and whether it was a hit and run.

**"when" variables:** include date, time, and light conditions.

**"road" variables:** speed limit, number of lanes, roadway alignment and profile, traffic flow, relation to junction and roadway, surface condition, traffic control device, and construction zone indication.

We present here four different analyses, all using the proposed method. The difference between the analyses is the features that are included in the feature vectors. We run the analysis when the feature vectors consist only of the "what", "when" and "road" variables, as well as a model incorporating all variables.

For executing the clustering one has to compute the similarity matrix and then find the minimum cut, which in the scheme presented here, is an iterative process. The runtime is on average 25 µ-seconds per sample per feature for creating the similarity matrix, and 28 seconds per iteration for computing the cut, regardless of the feature vector size. Thus, for 1,573 samples with feature vectors of length 28 (all features) it takes about 880 seconds for computing the similarity matrix and about 3 hours for 300 iterations, which were found to be sufficient for the clustering algorithm here.

#### Analysis of the "What" Variables

The segmentation of the 1,573 fatalities, considered in this study, into 12 clusters is presented in table 1. For each cluster, the table presents its size and the mean and mode (most frequent) values for each of the variables. The latter is important for descriptive features (i.e., manner of collision and harmful event) for which averaging makes little sense.

The presentation of the crash data in this fashion reveals an interesting phenomenon. For the analysis, let us consider clusters 1 and 2 of table 1 as one group, A, and clusters 3,4,5 and 6 as group B. This grouping procedure allows to pinpoint what are the features that separate the clusters of group B from the ones in A. Examining group B's characteristics shows an inverse proportional relation between the number of crashes involving pedestrians and the hit and run frequency. This is illustrated in figure 4, where each of the four clusters of group B are projected on a two-dimensional space— with the percentage in the cluster of Hit & Run accidents as one dimension and the percentage of accidents with pedestrians involved as the second. Hence, we have characterized crashes in group A by observing what makes this group different with respect to group B. Group A consists of 1,179 crashes out of 1,573 accidents that were analyzed. If we take the characteristics of the crashes in group A to represent the typical (or common) crash, we have found a behavioral relation between frequency of pedestrian involvement in crashes and the hit and run percentage.

**FIGURE 4 Trend Analysis of "What" Variables of Group B, Which Consists of the 3rd , 4th , 5th, and 6th Largest Clusters**

It would be reasonable to assume that the impact in a fatal accident when two cars are involved is much larger than a fatal crash when a vehicle and a pedestrian are involved, as in the former the passengers have the car's frame and its safety means to protect them. Therefore it is more probable that a car can flee the scene after being involved in a fatal accident with a pedestrian than after a fatal crash with another car. While intuitive, this association has not been investigated. Studies that tried to characterized hit and run crashes (e.g., Rifaat and Chin 2007; Tay et al. 2009; and Tay et al. 2008) found many other factors, such as roadway functional class, routes, traffic flow, types of roadway section, speed limit, traffic control device, functioning of traffic control device, lighting condition, roadway alignment and roadway profile, as important determinants. In these studies all accidents data were considered in the analysis. As a result crashes in group B weaken the connection found here. When applying the segmentation and effectively removing group B crashes, this connection manifests itself.

**TABLE 1 “What” Variable Segmentation**

For each cluster, the mean and the most common value (mode) of each characteristic are presented: The cluster size (size), Manner of Collision (MAN COLL), Harmful Event (HARM EV), number of fatalities (Fatals), number of persons involved (PERSONS), number of drunk drivers (DRUNK), number of pedestrians involved (PEDS), number of vehicles in transport involved (VE FORMS), number of total vehicles involved (VE TOTAL) and Hit and Run (HIT RUN)

Size | Man. coll. | Harm ev. | Fatals | Persons | Drunk | Peds. | Ve. forms | Ve. total | Hit run | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

mean | mode | mean | mode | mean | mode | mean | mode | mean | mode | mean | mode | mean | mode | mean | mode | mean | mode | ||

1 | 741 | 1.644 | 0 | 16.717 | 12 | 1.089 | 1 | 2.825 | 2 | 0.418 | 0 | 0.238 | 0 | 1.571 | 1 | 1.642 | 1 | 0.085 | 0 |

2 | 438 | 1.189 | 0 | 17.217 | 12 | 1.091 | 1 | 2.614 | 2 | 0.436 | 0 | 0.237 | 0 | 1.505 | 1 | 1.543 | 1 | 0.08 | 0 |

3 | 192 | 0.672 | 0 | 17.557 | 8 | 1.042 | 1 | 2.25 | 1 | 0.427 | 0 | 0.302 | 0 | 1.302 | 1 | 1.365 | 1 | 0.057 | 0 |

4 | 92 | 1.217 | 0 | 14.239 | 12 | 1.12 | 1 | 2.761 | 2 | 0.38 | 0 | 0.272 | 0 | 1.489 | 1 | 1.543 | 1 | 0.076 | 0 |

5 | 59 | 2.966 | 0 | 13.814 | 12 | 1.051 | 1 | 3.458 | 2 | 0.322 | 0 | 0.085 | 0 | 1.966 | 2 | 1.966 | 2 | 0.102 | 0 |

6 | 25 | 0.52 | 0 | 19.72 | 12 | 1.44 | 1 | 3.28 | 2 | 0.4 | 0 | 0.28 | 0 | 1.76 | 1 | 1.92 | 1 | 0.04 | 0 |

7 | 10 | 0 | 0 | 32 | 42 | 1 | 1 | 1.4 | 1 | 0.8 | 1 | 0.2 | 0 | 1 | 1 | 1 | 1 | 0.1 | 0 |

8 | 8 | 2.875 | 1 | 12 | 12 | 1.125 | 1 | 3.375 | 3 | 0.375 | 0 | 0 | 0 | 2.375 | 2 | 2.375 | 2 | 0.125 | 0 |

9 | 4 | 1 | 0 | 17 | 12 | 1 | 1 | 3 | 1 | 0.5 | 0 | 0.25 | 0 | 1.75 | 1 | 1.75 | 1 | 0.25 | 0 |

10 | 2 | 4.5 | 2 | 12 | 12 | 1 | 1 | 3 | 3 | 0 | 0 | 0 | 0 | 3 | 3 | 3 | 3 | 0.5 | 0 |

11 | 1 | 1 | 1 | 12 | 12 | 2 | 2 | 4 | 4 | 1 | 1 | 1 | 3 | 2 | 2 | 2 | 2 | 0 | 0 |

12 | 1 | 7 | 7 | 14 | 14 | 1 | 1 | 2 | 2 | 0 | 0 | 1 | 1 | 1 | 1 | 2 | 2 | 0 | 0 |

The results of an analysis to determine whether the link between pedestrians involvement and hit and run frequency holds outside the scope of the 1,573 fatalities investigated in our initial study are presented in figure 5. The analysis was carried out over the entire 2008 FARS data set, which holds 34,017 fatalities. Figure 5 depicts a plot of the frequency of hit and run versus the frequency of pedestrians crashes for each of the 53 states and U.S. territories. It is evident from figure 5 that this association between pedestrians and hit and run does hold.

For quantitative analysis of the results we apply analysis of variance (ANoVA) (Hogg and Ledolter 1987) on the segmentation results of the entire data-set—once for the clustering results and once for an arbitrary random clustering. For each "what" variable, ϑ_{i}, we compare the F-statistics and its corresponding p-value for the null-hypotheses between the clustering results and a random clustering assignment (with the same clusters size). The smaller the p-value for a given ϑ_{i}, the less homogeneous the different clusters with respect to this variable. In the comparison here, a feature with higher p-value in the NC'' clustering (than the random clustering) is salient in the clustering process. Hence, the clusters are characterized by different values of ϑ_{i} and are more homogeneous with respect to this feature. It is important to note that the features, which were found salient in the clustering process, came out over a large number of runs (thousands). The results of the ANoVA are presented in table 2. Hit-and-run and the number of fatalities are found to be the most salient features in the segmentation. This reinforces the findings above. An additional parameter that was found to be salient is the number of vehicles involved, which also can differentiate between fatal accident with or without pedestrians' involvement.

#### Analysis of the "When" Variables

**FIGURE 5 Frequency of Hit & Runs as a Function of the Frequency of Pedestrian Crashes**

**TABLE 2 Analysis of Variance (ANOVA) Results for “What” Variables Segmentation**

NC' | Random | |||
---|---|---|---|---|

F-stats | P-V | F-stats | P-V | |

Man. coll. | 1.02 | 0.42 | 1.1 | 0.36 |

Harm ev. | 1.03 | 0.42 | 0.74 | 0.87 |

Fatals | 0.44 | 0.78 | 1.22 | 0.3 |

Persons | 0.88 | 0.6 | 0.53 | 0.93 |

Drunk | 2.06 | 0.08 | 1.63 | 0.16 |

Peds. | 0.92 | 0.47 | 0.78 | 0.56 |

Ve. forms | 1.1 | 0.36 | 1.93 | 0.08 |

Ve. total | 0.46 | 0.8 | 1.97 | 0.07 |

Hit run | 0.14 | 0.98 | 1.23 | 0.29 |

**TABLE 3 “When” Variable Segmentation**

For each cluster, the mean and the most common value (mode) of each characteristic are presented: The cluster size (size), YEAR, MONTH, DAY of the Month, Day Of the Week (DOW), HOUR, MINUTE, light condition (LGT) and Weather (WEATHER)

Size | Year | Month | Day | Dow | Hour | Minutes | LGT | Weather | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

mean | mode | mean | mode | mean | mode | mean | mode | mean | mode | mean | mode | mode | mode | |

641 | 2008 | 2008 | 6.64 | 9 | 15.64 | 20 | 4.26 | 6 | 14.49 | 18 | 30.83 | 45 | 1 | 1 |

596 | 2008 | 2008 | 6.64 | 4 | 16.84 | 25 | 4.1 | 7 | 14.35 | 1 | 29.31 | 30 | 1 | 1 |

271 | 2008 | 2008 | 6.39 | 5 | 15.38 | 22 | 3.96 | 1 | 15.29 | 2 | 31.17 | 15 | 1 | 1 |

27 | 2008 | 2008 | 7.93 | 8 | 17.96 | 18 | 4.7 | 6 | 17.96 | 17 | 28.3 | 10 | 1 | 1 |

21 | 2008 | 2008 | 7.29 | 1 | 13.52 | 7 | 4.05 | 1 | 12.9 | 3 | 25.86 | 20 | 1 | 1 |

10 | 2008 | 2008 | 7.9 | 11 | 17.2 | 22 | 4.7 | 7 | 14.7 | 23 | 12.2 | 2 | 2 | 1 |

4 | 2008 | 2008 | 8.75 | 7 | 9.5 | 8 | 4 | 2 | 14.75 | 6 | 43.5 | 20 | 3 | 1 |

3 | 2008 | 2008 | 9 | 8 | 10.33 | 2 | 1.67 | 1 | 10.33 | 6 | 18.67 | 2 | 1 | 1 |

**FIGURE 6 Histograms of the When Clusters for Month, Day of the Month (Day), Day of the Week (DOW), and Hour of Crash**

Table 3 details the segmentation results of the 1,573 fatalities considered in this study by their "when" attributes. In order to allow better understanding of the results, figure 6 depicts the histograms of the month, day of the month, day of the week and the reported hour of the crash for the three largest clusters. The evaluation of the histograms suggests that the first cluster (size of 641 fatalities) is characterized by weekend crashes, mainly at night time. This is derived by the DOW and hour histograms. The second largest cluster (size 596) consists of weekday crashes. We conclude this as the third cluster (271 crashes) consists of mainly spring-summer weekend vacation crashes. The third cluster's month histogram shows a significant peak between April and August; its Day histogram has strong periodical peaks that correspond to weekends; and finally its DOW histogram clearly shows that the distribution of the day of the crash, for this cluster, is biased towards the weekend days. Another characteristic that suggest that this cluster is a vocation weekend cluster is that its hour histogram does not have peaks that correspond to the morning or afternoon rush. The hour histogram of the first (largest) two clusters does present these peaks. Another interesting phenomenon that the clusters reveal is that the second week of the month of the first two clusters has fewer numbers of crashes.

**TABLE 4 "Road" Variable Segmentation**

For each cluster, the mean and the most common value (mode) of each characteristic are presented: Cluster size (size), Speed Limit (SPEED), number of lanes (LANES), Roadway Alignment (ALIGN), Roadway Profile (Profile), Trafficway Flow (FLOW), Relation to Junction (REL JUNC), Relation to roadway (REL ROAD), Surface Type (PAVE TYP), Surface condition (SUR COND), Traffic control Device (TRA CONT), Traffic Control Device Functioning (TCF) and Construction / Maintenance Zone (Zone)

Size | Speed | Lanes | Align | Profile | Flow | Rel. junc. | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

mean | mode | mean | mode | mean | mode | mean | mode | mean | mode | mean | mode | |

882 | 50.558 | 55 | 2.822 | 2 | 1.339 | 1 | 1.615 | 1 | 2.141 | 1 | 1.724 | 1 |

501 | 49.553 | 55 | 2.784 | 2 | 1.299 | 1 | 1.715 | 1 | 2.076 | 1 | 1.904 | 1 |

107 | 47.196 | 65 | 2.804 | 2 | 1.299 | 1 | 1.579 | 1 | 2.187 | 1 | 1.738 | 1 |

65 | 45.2 | 35 | 2.662 | 2 | 1.154 | 1 | 1.338 | 1 | 2.062 | 1 | 2.231 | 1 |

10 | 46 | 45 | 2.6 | 2 | 1.1 | 1 | 1.2 | 1 | 2.2 | 2 | 1.6 | 1 |

5 | 54 | 55 | 2.2 | 2 | 1.6 | 2 | 1.4 | 1 | 1.2 | 1 | 1 | 1 |

2 | 47.5 | 40 | 2 | 2 | 1.5 | 1 | 1 | 1 | 1.5 | 1 | 2 | 1 |

1 | 25 | 25 | 2 | 2 | 1 | 1 | 2 | 2 | 1 | 1 | 1 | 1 |

Size | Rel. road | Pave type | Sur. cond. | Tra. cont. | TCF | Zone | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

mean | mode | mean | mode | mean | mode | mean | mode | mean | mode | mean | mode | |

882 | 2.398 | 1 | 1.906 | 2 | 1.129 | 1 | 3.582 | 0 | 0.638 | 0 | 0.023 | 0 |

501 | 2.399 | 1 | 1.982 | 2 | 1.044 | 1 | 3.307 | 0 | 0.669 | 0 | 0.03 | 0 |

107 | 2.084 | 1 | 1.841 | 2 | 1.159 | 1 | 2.037 | 0 | 0.626 | 0 | 0.084 | 0 |

65 | 1.862 | 1 | 1.908 | 2 | 1.092 | 1 | 5.569 | 0 | 1.338 | 0 | 0.015 | 0 |

10 | 1.6 | 1 | 1.9 | 2 | 1.1 | 1 | 3.7 | 0 | 1.2 | 0 | 0.1 | 0 |

5 | 2 | 1 | 2 | 2 | 1.2 | 1 | 16 | 0 | 1.2 | 0 | 0 | 0 |

2 | 1 | 1 | 2 | 2 | 1 | 1 | 1.5 | 0 | 1.5 | 0 | 0 | 0 |

1 | 1 | 1 | 2 | 2 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |

For each cluster, the mean and the most common value (mode) of each characteristic are presented: Cluster size (size), Speed Limit (SPEED), number of lanes (LANES), Roadway Alignment (ALIGN), Roadway Profile (Profile), Trafficway Flow (FLOW ), Relation to Junction (REL JUNC), Relation to roadway (REL ROAD), Surface Type (PAVE TYP), Surface condition (SUR COND), Traffic control Device (TRA CONT ), Traffic Control Device Functioning (TCF) and Construction / Maintenance Zone (Zone)

**FIGURE 7 Histograms of the Road Clusters for Speed Limit (Speed), Relation to Junction (Junc) and Control Device (ContDev)**

Having this information in hand, one can further evaluate the crashes within each cluster in different aspects. Such aspects may be the severity of the crash, the drivers' characteristics, and the type of cars involved. As the crashes within a cluster are more homogeneous than the entire data set, this analyses are expected to provide new insights.

#### Analysis of the "Road" Variables

The clustering results of the crashes by the "road" variables are given in table 4 and figure 7. The three largest clusters can be characterized as highway crashes. This notion is based on the speed histograms which, for the first three clusters, are shifted towards the higher speed limits.

The notion is also supported by the histograms of the relation to junction and control device, where the fourth cluster presents larger numbers than the first three in noninterchange-junctions or at traffic signals. Having the data clustered prior to any analysis, as illustrated here, will allow for finding more subtle trends as the crash groups are homogeneous and the noise is filtered out. Table 5 presents the ANoVA results for the road variables. The same variables that were found to play a significant role in the segmentation process in the analysis above are shown to be salient in the ANoVA results as well. The dominant features are the number of lanes, road profile, speed limit and relation to roadway. These findings support the observation that this clustering distinguishes between highway and non-highway accidents.

**TABLE 5 Analysis of Variance (ANOVA) Results for "Road" Variables Segmentation**

NC' | Random | |||
---|---|---|---|---|

F-stats | P-V | F-stats | P-V | |

Speed | 0.4 | 0.96 | 0.64 | 0.83 |

Lanes | 0.81 | 0.58 | 1.86 | 0.07 |

Align | 1.11 | 0.33 | 1.19 | 0.31 |

Profile | 0.09 | 0.96 | 1.16 | 0.33 |

Flow | 3.33 | 0 | 0.29 | 0.94 |

Rel. junc. | 1.43 | 0.15 | 0.85 | 0.58 |

Rel. road | 1.02 | 0.42 | 1.65 | 0.01 |

Pave typ. | 1.03 | 0.39 | 0.43 | 0.79 |

Sur. cond. | 0.87 | 0.51 | 0.34 | 0.91 |

Tra. cont. | 3.38 | 0.02 | 3.08 | 0.03 |

TCF | 1.84 | 0.04 | 1.79 | 0.04 |

Zone | 0.56 | 0.64 | 0.6 | 0.61 |

**TABLE 6 Typical Road Fatality**

Variable | Entire data | Segment 1 | Segment 2 | Segment 3 | Segment 4 | |
---|---|---|---|---|---|---|

Size | 1573 | 422 | 413 | 334 | 176 | |

Month | mean | 6.65 | 6.79 | 6.3 | 6.75 | 6.98 |

mode | 9 | 11 | 2 | 7 | 9 | |

std | 3.42 | 3.52 | 3.46 | 3.38 | 3.26 | |

Day | mean | 16.04 | 16.32 | 16.19 | 16.12 | 14.74 |

mode | 27 | 15 | 10 | 23 | 7 | |

std | 8.74 | 8.68 | 9.12 | 8.4 | 8.54 | |

Hour | mean | 14.66 | 15 | 14.51 | 14.74 | 15.2 |

mode | 18 | 18 | 22 | 17 | 20 | |

std | 14.71 | 14.15 | 13.9 | 15.78 | 15.96 | |

Minute | mean | 30.09 | 30.05 | 29.15 | 31.22 | 31.08 |

mode | 30 | 20 | 30 | 45 | 35 | |

std | 20.12 | 19.83 | 19.43 | 20.75 | 20.04 | |

Ve. total | mean | 1.59 | 1.56 | 1.58 | 1.7 | 1.42 |

mode | 1 | 1 | 1 | 1 | 1 | |

std | 0.89 | 0.82 | 0.85 | 1.03 | 0.69 | |

Persons | mean | 2.72 | 2.63 | 2.7 | 3.07 | 2.38 |

mode | 2 | 2 | 2 | 2 | 2 | |

std | 2.2 | 1.63 | 2.07 | 3.18 | 1.46 | |

Peds. | mean | 0.24 | 0.27 | 0.22 | 0.25 | 0.23 |

mode | 0 | 0 | 0 | 0 | 0 | |

std | 0.53 | 0.54 | 0.5 | 0.63 | 0.46 | |

Harm ev. | mean | 16.82 | 16.2 | 17.35 | 15.95 | 17.7 |

mode | 12 | 12 | 12 | 12 | 12 | |

std | 12.4 | 11.47 | 12.69 | 12 | 13.18 | |

Man. coll. | mean | 1.41 | 1.46 | 1.2 | 1.66 | 1.02 |

mode | 0 | 0 | 0 | 0 | 0 | |

std | 3.31 | 2.28 | 2.08 | 5.78 | 2.05 | |

Rel. junc. | mean | 1.8 | 1.71 | 1.64 | 1.87 | 1.79 |

mode | 1 | 1 | 1 | 1 | 1 | |

std | 3.22 | 1.88 | 1.99 | 2.37 | 2.2 | |

Rel. road | mean | 2.35 | 2.24 | 2.68 | 1.99 | 2.14 |

mode | 1 | 1 | 1 | 1 | 1 | |

std | 5.14 | 4.96 | 6.94 | 1.53 | 1.52 | |

Flow | mean | 2.12 | 2.09 | 2.18 | 2.12 | 2.11 |

mode | 1 | 1 | 1 | 1 | 1 | |

std | 1.55 | 1.67 | 1.39 | 1.57 | 1.6 | |

Lanes | mean | 2.8 | 2.84 | 2.78 | 2.85 | 2.8 |

mode | 2 | 2 | 2 | 2 | 2 | |

std | 1.35 | 1.43 | 1.28 | 1.37 | 1.45 | |

Speed | mean | 49.75 | 48.49 | 50.54 | 49.75 | 49.68 |

mode | 55 | 55 | 55 | 55 | 55 | |

std | 14.68 | 14.52 | 14.62 | 15.43 | 14.78 | |

Align | mean | 1.32 | 1.33 | 1.27 | 1.31 | 1.28 |

mode | 1 | 1 | 1 | 1 | 1 | |

std | 0.94 | 0.94 | 0.78 | 0.94 | 0.92 | |

Profile | mean | 1.63 | 1.54 | 1.77 | 1.56 | 1.65 |

mode | 1 | 1 | 1 | 1 | 1 | |

std | 1.85 | 1.67 | 2.07 | 1.77 | 1.94 | |

Pave typ. | mean | 1.93 | 1.97 | 1.92 | 1.88 | 1.99 |

mode | 2 | 2 | 2 | 2 | 2 | |

std | 0.87 | 0.89 | 1 | 0.59 | 1.12 | |

Sur. cond. | mean | 1.1 | 1.12 | 1.09 | 1.07 | 1.19 |

mode | 1 | 1 | 1 | 1 | 1 | |

std | 0.47 | 0.5 | 0.33 | 0.3 | 0.94 | |

Tra. cont. | mean | 3.51 | 3.25 | 2.75 | 4.13 | 3.6 |

mode | 0 | 0 | 0 | 0 | 0 | |

std | 9.39 | 8.3 | 8.23 | 10.61 | 9.09 | |

TCF | mean | 0.68 | 0.75 | 0.55 | 0.79 | 0.65 |

mode | 0 | 0 | 0 | 0 | 0 | |

std | 1.28 | 1.35 | 1.16 | 1.32 | 1.24 | |

Hit run | mean | 0.08 | 0.09 | 0.07 | 0.07 | 0.1 |

mode | 0 | 0 | 0 | 0 | 0 | |

std | 0.58 | 0.61 | 0.56 | 0.48 | 0.7 | |

LGT | mean | 1.92 | 1.88 | 1.96 | 1.94 | 1.95 |

mode | 1 | 1 | 1 | 1 | 1 | |

std | 1.01 | 1.02 | 0.96 | 1 | 1.2 | |

Zone | mean | 0.03 | 0.03 | 0.04 | 0.01 | 0.04 |

mode | 0 | 0 | 0 | 0 | 0 | |

std | 0.25 | 0.24 | 0.32 | 0.08 | 0.2 | |

Fatals | mean | 1.09 | 1.05 | 1.1 | 1.12 | 1.1 |

mode | 1 | 1 | 1 | 1 | 1 | |

std | 0.41 | 0.23 | 0.33 | 0.58 | 0.54 | |

DOW | mean | 4.15 | 4.27 | 4.08 | 4.21 | 3.88 |

mode | 7 | 6 | 6 | 7 | 1 | |

std | 2.14 | 2.1 | 2.16 | 2.17 | 2.17 | |

Drunk | mean | 0.42 | 0.43 | 0.45 | 0.43 | 0.37 |

mode | 0 | 0 | 0 | 0 | 0 | |

std | 0.64 | 0.65 | 0.64 | 0.61 | 0.65 | |

Ve. forms | mean | 1.54 | 1.52 | 1.52 | 1.62 | 1.37 |

mode | 1 | 1 | 1 | 1 | 1 | |

std | 0.84 | 0.8 | 0.82 | 0.91 | 0.59 | |

Weather | mean | 1.11 | 1.08 | 1.11 | 1.11 | 1.15 |

mode | 1 | 1 | 1 | 1 | 1 | |

std | 0.57 | 0.36 | 0.59 | 0.59 | 0.82 |

**TABLE 7 Analysis of Variance (ANOVA) results for all variables**

NC | Random | |||
---|---|---|---|---|

F-stats | P-V | F-stats | P-V | |

Month | 0.82 | 0.62 | 1.07 | 0.38 |

Day | 0.86 | 0.69 | 0.89 | 0.64 |

Hour | 1.37 | 0.11 | 1.01 | 0.44 |

Minute | 0.94 | 0.6 | 0.99 | 0.49 |

Ve. total | 0.47 | 0.8 | 0.81 | 0.54 |

Persons | 1.05 | 0.4 | 1.2 | 0.26 |

Peds | 1.39 | 0.22 | 1.64 | 0.15 |

Harm ev. | 0.9 | 0.64 | 1.03 | 0.42 |

Man. coll. | 1.14 | 0.33 | 1.29 | 0.24 |

Rel. junc. | 1.14 | 0.33 | 1.83 | 0.05 |

Rel. road | 0.95 | 0.48 | 1.61 | 0.11 |

Flow | 0.4 | 0.88 | 1.1 | 0.36 |

Lanes | 1.09 | 0.37 | 0.72 | 0.65 |

Speed | 1.95 | 0.02 | 0.71 | 0.75 |

Align | 2.73 | 0.07 | 0.88 | 0.41 |

Profile | 0.47 | 0.7 | 0.57 | 0.63 |

Pave typ. | 0.7 | 0.59 | 1.45 | 0.22 |

Sur. cond | 0.52 | 0.8 | 1.11 | 0.35 |

Tra. cont. | 1.66 | 0.07 | 0.46 | 0.94 |

TCF | 4.23 | 0.01 | 0.65 | 0.58 |

Hit run | 0.99 | 0.42 | 1.25 | 0.29 |

LGT | 0.67 | 0.65 | 0.59 | 0.71 |

Zone | 0.12 | 0.95 | 0.32 | 0.81 |

Fatals | 1.41 | 0.23 | 0.66 | 0.62 |

DOW | 0.99 | 0.43 | 1.02 | 0.41 |

Drunk | 0.78 | 0.55 | 0.61 | 0.66 |

Ve. forms | 0.66 | 0.65 | 0.64 | 0.67 |

Weather | 0.6 | 0.7 | 1.65 | 0.15 |

#### Analysis of All Variables

The segmentation procedure creates homogeneous clusters of crashes. Inter- and intra- analysis of these clusters allows for observing subtle trends and small changes. In previous sections the analysis was carried out when different groups of features out of the 28 crash characteristics were considered. In this section we provide the analysis when all 28 parameters are incorporated into the process. Table 6 details the mean, mode and standard deviation (std) of each of the 28 parameter for the entire data set as well as for each of the 4 largest clusters. Evaluating the standard deviations shows that there are several features that exhibit significantly lower values for one or more clusters than the standard deviation of the entire data set. Examples are number of persons for the 4th segment, manner of collision in segments 1, 2 and 4, relation to the junction in all segments, road profile (segments 1 and 3), number of fatalities (segments 1 and 2), number of vehicles involved (segment 4) and weather (Segment 1). Following the statistical law of large numbers, had the clusters were assembled randomly the standard deviation would have been higher for the smaller subsets. Thus a lower standard deviation value for a specific feature of a certain cluster (with respect to the entire data set) signifies that this feature is more homogenous in the cluster. The same findings are obtained through the analysis of variance (ANoVA), which is presented in table 7. Hence the same set of features presents higher P-values for the NC'' clustering comparing to the random-clustering values. These are the features that characterized the cluster and form its common fatality. This information can be then used for further study for extracting new connections between features or to single out a cluster which is more suitable for investigation of a certain phenomenon. Using these lines of research for analyzing crash data is expected to result in new insights and observations.

### Conclusions

Road fatality analysis pose a great challenge as traffic crashes are rare outcomes of events that are confined to a small time-space region. This problem increases when a subset of crashes is considered (e.g., pedestrians or fatal crashes). In this paper we propose to apply a cluster analysis of the crashes prior to any other analysis. Then, each cluster is separately analyzed both internally and with respect to the other clusters. Those inter- intra-relations of the cluster point out trends phenomena and characteristics that are less prominent in the entire data set, and therefore are often overlooked. This procedure is known as mixed models and several such analyses were presented in traffic accident analysis (e.g., Depaire et al. 2008; Lord and Mannering 2010).

For the cluster analysis, we employ a graph- theory-based segmentation algorithm. The algorithm is based on the well-known normalized-cut optimization criterion. The solution is sought through a stochastic approximation scheme. This scheme is a novel extension of a method for solving the minimum-cut problem.

The cluster analysis often results in finding subtle trends and significant causes for traffic fatalities. For example, the method has found a correlation between hit-and-run and pedestrians fatalities, which was not identified by previous studies. An additional output of the research is a description of the typical fatality, which is a result of the segmentation analysis done when all factors that characterized a crash are considered.

Future research may expand the analysis presented here so other features that are recorded in the FARS data set are considered and naturally for analyzing other data sets. An emerging research field in the traffic safety arena is naturalistic driving studies (Guo and Fang 2012; Klauer et al. 2006). These studies are based on collecting continuous streams of different data. The data collected sums up to a sheer amount, which calls for exploratory tools such as the scheme presented here. Due to the efficiency of the presented mechanism, it likely to highlight new insights in the examined big-data and may lead to new findings in highway safety.

**TABLE 8 Accident File Data Fields (C indicates a categorical variable)**

Short name | Description |
---|---|

Man. coll. (C) | Manner Of Collision |

Harm ev. (C) | First Harmful Event |

Fatals | Number of Fatalities In Crash |

Persons | Number of Person Forms Submitted |

Drunk | Number of Drunk Drivers in Crash |

Peds. | Number of Non-Motorist Forms Submitted |

Ve. forms | Number of Vehicle Forms Submitted, Motor Vehicles in Transport |

Ve. total | Number of Vehicle Forms Submitted, Total-Includes Motor Vehicles Not in Transport |

Hit run | Hit-And Run |

Year | Crash Year |

Month | Crash Month |

Day | Crash Day of the Month |

DOW | Day of Week |

Hour | Crash Hour |

Minute | Crash Minute |

LGT (C) | Light Condition |

Weather(C) | Atmospheric Conditions |

Speed | Speed Limit |

Lanes | Number of Travel Lanes Align (C) |

Flow | Trafficway Flow |

Rel. junc. (C) | Relation To Junction |

Rel. road (C) | Relation To Roadway PAVE TYP (C) |

Sur. cond. (C) | Roadway Surface Condition |

TCF (C) | Traffic Control Device Functioning |

Tra. cont. (C) | Traffic Control Device |

Zone (C) | Construction/Maintenance Zone |

### Acknowledgments

The authors would like to thank John E. Baumler for his great help and contribution in devising and implementing the stochastic approximation scheme for normalized cut.

The first author was partially funded by the New-York Metropolitan and the Technion's Security Science and Technology research funds, The German-Israeli Foundation for Scientific Research and Development (GIF) Young Scientist Program, the Technion Center of Excellence in Exposure Science and Environmental Health and the CITI-SENSE project of the 7th European Framework Program (FP7/2007-2013), under grant agreement No.30824.F. The second author acknowledges the support of the UC Berkeley Safe Transportation Research and Education Center (SafeT- REC).

### References

Abdel-Aty, M., and A. Pande, *Crash data analysis: Collective vs. individual crash level approach*, Journal of Safety Research 38 (2007), no. 5, 581 – 587.

Aho, A., J. Hopcroft, and J. Ullman, *Data structure and algorithms*, Boston: Addison-Wesley, 1983.

Alon, N., D. Moshkovitz, and S. Safra, *Algorithmic construction of sets for k-restrictions*, ACM Trans. Algorithms (2006), no. 2, 153–177.

Ang, Q., A. Baddeley, and G. Nair, *Geometrically corrected second order analysis of events on a linear network, with applications to ecology and criminology*, Scandinavian Journal of Statistics 39 (2012), no.4, 591-617.

Bailey, T.C., and A.C. Gatrell, *Interactive spatial data analysis*, Longman, Harlow, 1995.

Black, W.R. *Highway accidents: a spatial and temporal analysis*, Transportation Research Record 1318 (1991), 75–82.

Cressie, N., *Statistics for spatial data*, John Wiley & Sons, New York., 1993.

Depaire, B., G. Wets, and K. Vanhoof, *Traffic accident segmentation by means of latent class clustering*, Accident Analysis & Prevention 40 (2008), no. 4, 1257 – 1266.

Diggle, P.J. *Spatial analysis of spatial point patterns*, Academic Press, New York., 1983.

Ford, L.R., and D.R. Fulkerson, *Maximal flow through a network*, Canadian Journal of Math. 8 (1956), no. 3, 339 – 404.

Fotheringham, A., C. Brunsdon, and M. Charlton, *Quantitative geography: Perspectives on spatial data analysis, *Sage Publication, London., 2000.

Frhwirth-Schnatter, S., *Finite mixture and markov switching models*, Springer Series in Statistics, Springer, New York, 2006.

Getis, A., and J. Franklin, *Second-order neighborhood analysis of mapped point patterns, Perspectives on Spatial Data Analysis *(Luc Anselin and Sergio J. Rey, eds.), Advances in Spatial Science, Springer Berlin Heidelberg, 2010, pp. 93–100.

Golob, T.F., and W.W. Recker, *Relationships among urban freeway accidents, traffic flow, weather, and lighting conditions*, Journal of Transportation Engineering 129 (2003), no. 4, 342–353.

Golob, T.F., and W.W. Recker, *A method for relating type of crash to traffic flow characteristics on urban freeways*, Transportation Research Part A: Policy and Practice 38 (2004), no. 1, 53 – 80.

Griswold, J., B. Fishbain, S. Washington, and D.R. Ragland, *Visual assessment of pedestrian crashes*, Accident Analysis & Prevention 43 (2011), no. 1, 301 – 306.

Guo, F., and Y. Fang, *Individual driver risk assessment using naturalistic driving data*, Accident Analysis & Prevention (2012).

Hauer, E. *On the estimation of the expected number of accidents*, Accident Analysis & Prevention 18 (1986), no. 1, 1– 12.

Hogg, R. V., and J. Ledolter, *Engineering statistics*, New York: MacMillan, 1987.

Karger, D.R., and C. Stein, *A new approach to the mini**mum cut problem*, J. ACM 43 (1996), no. 4, 601–640.

Karp, R. M., *Complexity of computer computations*, Reducibility Among Combinatorial Problems, p. 85103, New York: Plenum, 1972.

Kendall, M.G., *A new measure of rank correlation*, Biometrika 30 (1938), no. 1-2, 81.

Kingham, S., C.E. Sabel, and P. Bartie, *The impact of the school run on road traffic accidents: A spatio*-*temporal analysis*, Journal of Transport Geography 19 (2011), no. 4, 705 – 711.

Klauer, S.G., T.A. Dingus, V.L. Neale, JD Sudweeks, and DJ Ramsey, *The impact of driver inattention on near crash/crash risk: An analysis using the 100-car naturalistic driving study data*, Tech. report, 2006.

Lee, C., B. Hellinga, and F. Saccomanno, *Real-time crash prediction model for application to crash prevention in freeway traffic*, Transportation Research Record 1840 (2007), 67–77.

Li, X., D. Lord, Y. Zhang, and Y. Xie, *Predicting motor vehicle crashes using support vector machine models*, Accident Analysis & Prevention 40 (2008), no. 4, 1611– 1618.

Lord, D., and F. Mannering, *The statistical analysis of crash-frequency data: A review and assessment of methodological alternatives*, Transportation Research Part A: Policy and Practice 44 (2010), no. 5, 291 – 305.

Ma, J., and K.M. Kockelman, *Bayesian multivariate Poisson regression for models of injury count, by severity*, Transportation Research Record: Journal of the Transportation Research Board 1950 (2006), no. 1, 24–34.

McGuigan, D.R.D. *The use of relationships between road accidents and traffic flow in black-spot identification*, Traffic Engineering and Control 22 (1981), 448 – 453.

Miaou, S., and H. Lum, *Modeling vehicle accidents and highway geometric design relationships*, Accident Analysis & Prevention 25 (1993), no. 6, 689 – 709.

Miaou, S., and J.J. Song, *Bayesian ranking of sites for engineering safety improvements: Decision parameter, treatability concept, statistical criterion, and spatial dependence*, Accident Analysis & Prevention 37 (2005), no. 4, 699 – 720.

Mussone, L., A. Ferrari, and M. Oneta, *An analysis of urban collisions using an artificial intelligence model*, Accident Analysis & Prevention 31 (1999), no. 6, 705 – 718.

National Highway Traffic Safety Administration, Fatality analysis reporting system: Fatal crash data overview, U.S. Department of Transportation Publication Washington, DC, 2004, DOT HS 809726.

Persaud, B.N., *Estimating accident potential of Ontario road sections*, Transportation Research Record 1327 (1991), 4753.

Plug, C., J.C. Xia, and C. Caulfield, *Spatial and tempo**ral visualization techniques for crash analysis*, Accident Analysis & Prevention 43 (2011), no. 6, 1937 – 1946.

Rifaat, S. M., and H. C. Chin, *Accident severity analysis using ordered Probit *model, Journal of Advanced Transportation 41 (2007), no. 1, 91–114.

Shi, J.B., and J. Malik, *Normalized cuts and image segmentation*, IEEE Transactions On Pattern Analysis and Machine Intelligence 22 (2000), no. 8, 888–905.

Shino, S., *Analysis of a distribution of point events using the network-based quadrat method*, Geographical Analysis 40 (2008), no. 4, 380–400.

Steenberghen, T., K. Aerts, and I. Thomas, *Spatial clustering of events on a network, *Journal of Transport Geography 18 (2010), no. 3, 411 – 418

Steenberghen, T., T. Dufays, I. Thomas, and B. Flahaut, *Intra-urban location and clustering of road accidents using GIS: a Belgian example*, International Journal of Geographical Information Science 18 (2004), no. 2, 169-181.

Sullivan, J.M., and M.J. Flannagan, *Characteristics of pedestrian risk in darkness*, http://hdl.handle.net/2027.42/49450, 2001, UMTRI-2001-33, PB2002-101216.

Tay, R., U. Barua, and L. Kattan, *Factors contributing to hit-and-run in fatal crashes*, Accident Analysis & Prevention 41 (2009), no. 2, 227 – 233.

Tay, R., S. M. Rifaat, and H. C. Chin, *A logistic model of the effects of roadway, environmental, vehicle, crash and driver characteristics on hit-and-run crashes*, Accident Analysis & Prevention 40 (2008), no. 4, 1330 – 1336.

Vlahogianni, E. I., M. G. Karlaftis, and J. C. Golias, *Spatio-temporal short-term urban traffic volume forecasting using genetically optimized modular networks*, Computer-Aided Civil and Infrastructure Engineering 22 (2007), no. 5, 317–325.

Vogt, A., and J. Bared, *Accident models for two-lane rural segments and intersections*, Transportation Research Record: Journal of the Transportation Research Board 1635 (1998), no. 1, 18–29.

Wong, S.C., N.N. Sze, and Y.C. Li, *Contributing factors to traffic crashes at signalized intersections in Hong Kong*, Accident Analysis & Prevention 39 (2007), no. 6, 11071113.

Xie, Y., D. Lord, and Y. Zhang, *Predicting motor vehicle collisions using bayesian neural network models: An empirical analysis*, Accident Analysis & Prevention 39 (2007), no. 5, 922 – 933.

Xie, Z., and J. Yan, *Kernel density estimation of traffic accidents in a network space*, Computers, Environment and Urban Systems 32 (2008), no. 5, 396 – 406.

Yamada, I., and J. Thill, *Comparison of planar and network K-functions in traffic accident analysis*, Journal of Transport Geography 12 (2004), no. 2, 149 – 158.

Zhang, Y., and Y. Xie, *Forecasting of short-term freeway volume with <i>v</i>-support vector machines*, Transportation Research Record: Journal of the Transportation Research Board 2024 (2008), no. 1, 92–9