Probe Data and Transportation Statistics
Probe data are generated by a technological device that is either carried by individuals or located on or in vehicles, vessels, or other conveyances. The devices that produce these data actively determine and emit information about their locations. That information—when responsibly monitored, collected, stored, and analyzed—can reveal the movement of the probe device through space and across time, thereby providing innovative opportunities to measure the transportation system and generate national-level statistics on the mobility of people, vehicles, and goods in the United States.
BTS, in response, explored the use, quality, and availability of these data, serving its mission to promote innovative methods of data collection, analysis, visualization, and dissemination to provide timely, accurate, and credible information on the U.S. transportation system.
The primary output from this project is a set of six short-form user guides, each on a specific form of probe data available to the transportation sector: disaggregate location-based services (LBS), aggregate location-based services (LBS), including Point-of-Interest (POI) data, cell phone call records, connected vehicles (CVs), commercial vehicle electronic logging devices (ELDs), and freight truck fleet GPS.
Each guide has the exact same sections, each dedicated to a different facet of each data type:
- Advantages
- Limitations
- Vendors
- Audiences
- Scale of availability
- Technical details
- Generation methods and triggers
- Origins and background
- Contents, sample penetration, and quality
- Spatial and temporal coverage and resolution
- Use cases
The findings of this project, templated to facilitate quick comparisons across data types, are intended for transportation planners, researchers, policy makers, and other public officials—including those at BTS and any other public agency—interested in using or procuring probe-derived statistical products. Links to additional research on each data types are included in each guide for users who want to learn more.
Location based–services (LBS) data comprise location sightings that are derived from the usage of location-enabled mobile applications (apps) on smartphones and/or cellular-enabled tablets. In other words, LBS data encompass the geographic information generated when mobile apps request position updates. These data are collected through software development kits (SDKs) embedded in mobile apps, creating an array of location pings that can be triggered by user interactions, background processes, or automated system functions. Unlike data sources that are purpose-built for transportation, LBS data are a byproduct of broader commercial mobile app functionality.
A fundamental limitation of LBS data is their opportunistic nature—location information is captured whenever apps require positioning for their core functionality, whether for navigation, advertising, social media check-ins, or weather services. This behavior creates highly irregular spatiotemporal patterns that vary dramatically based on individual user behavior, device settings, and app-usage patterns.
The LBS data marketplace has experienced significant volatility due to evolving privacy regulations and platform (i.e., software ecosystem) restrictions. Major changes to smart phone operating systems' privacy control have substantially reduced data availability. Additionally, high-profile privacy concerns and regulatory scrutiny have led some major apps to discontinue reselling location data entirely based on each app, original equipment manufacturer (OEM), and carrier.
Example statistics
Aggregate location based–services (LBS), navigation, and point-of-interest (POI) probe data products (“aggregate products”) summarize temporal–location information that is originally collected from a probe at a finer, disaggregate spatial scale. These products are only available to end users in an aggregate format. They can rely on any form of probe source data as long as the data come with spatiotemporal information that can be meaningfully summarized vis-à-vis a larger study area or site. As such, they depend on whichever location-detection technology was used to generate their source of probe data (e.g., LBS, connected-vehicle [CV], and Global Positioning System [GPS] data from navigation applications [apps] and platforms; synthetic data; and transaction counts tied to specific business locations).
For these products, aggregation is performed against sightings (e.g., GPS, LBS, financial or commercial transactions), which are summarized against provided or known areas of interest, whether they be polygonal geographic areas (e.g., states, traffic analysis zones [TAZs], metropolitan planning organizations), linear features (e.g., road networks, rail networks), or targeted points (e.g., businesses, addresses). Examples of aggregate outputs include device or vehicle counts associated with highway network links, origin–destination (OD) matrices mapped to geographic polygons, or business visitation counts tied to building footprints.
The following are predominant examples of aggregate probe data products:
- LBS data are aggregate to a geographic polygon or developed into an OD matrix within a specified date or time range.
- Navigation data are aggregate to roadway segments and include metrics about speed by mode (truck or car).
- POI data include activity patterns within preset business locations and/or parking lots as well as breakouts of the previous and subsequent places each observed device visited before and after that location.
The primary advantage of aggregate products is that they do not require an end user to acquire and process the enormous amounts of disaggregate probe data that would otherwise need to procure and analyze to produce commensurate outputs. Vendors of these products can customize outputs to the specific needs of customers, who get to use probe data-derived insights without needing data science resources. Aggregate products sold by vendors may also incorporate disaggregate data not otherwise available to the public since their end products are aggregate and will inherently mask any private or sensitive information.
The main downside to these products, however, is that their vendors do not have to disclose anything about the underlying disaggregate data to their buyers, an inherent constraint on the usability of their products when compared those done directly on or with disaggregate data. The acquisition of aggregate products, which may significantly reduce computational burden on a customer, inherently comes with less control over the generation of probe-based statistics. This trade-off remains for vendors that aggregate the data underlying their products themselves as well as those that resell or repurpose other vendors’ products.
Because end users of aggregate products are not guaranteed access to the raw or disaggregate data points that contributed to the aggregate outputs that they purchased, they are beholden to whatever documentation on processing methods, sample sizes, or quality metrics the vendor makes available. Vendors typically provide limited transparency into how they process, aggregate, or derive analytics from the underlying data sources, and little is known about sample sizes over time or descriptive statistics, such as data composition.
Cell phones directly produce two types of probe data, both of which leverage the spatiotemporal information derivable from the relative strength of a phone’s connection to cell towers owned by its corresponding mobile phone carrier: event-driven call detail records (CDRs) and network-driven passive-signaling data. Both data types are available from any type of mobile phone, a side effect of mobile phone carriers’ need to implement usage-based customer billing, optimize networks, and improve 911 emergency response rates. While these location-detection methods bear some resemblance to the Global Positioning System (GPS)-based procedures underlying location-based services (LBS) sightings from smart phones, they are performed by the phone itself and do not depend on software-based applications (apps).
Developed in the 2000s and 2010s, cellular probe data, a collective term for both CDRs and passive-signaling data, were one of the earliest examples of nontransportation-specific big spatial data being harnessed for transportation-planning and ‑modeling purposes. As they proxy human movement vis-à-vis the traces left by people as they use their cell phones over the course of their daily lives, transportation specialists saw them as a novel, nonsurvey-based method to measure mobility patterns and flows. Cell phone companies, who primarily harvest these data from their devices to track their customers as part of their standard business practices, began selling them to transportation-oriented analytics firms, who converted, enriched, and sold them as transportation statistical products for planners and other mobility professionals.
Such work continued until 2018, when major mobile phone carriers abruptly stopped sharing data with analytics vendors due to privacy, policy, and public relations concerns. The analysts previously using these cell network–derived inputs subsequently switched to LBS location data that, while still sourced from cell phones, rely on more high-precision, GPS-derived position data emitted from the apps running on these devices. While the former inputs were provided by mobile phone carriers, the latter are typically sourced from smart phone app developers, who can theoretically pull data from devices from multiple carriers.
Despite these developments, mobile phone carriers are still presumed to produce and log cellularly derived probe data for internal purposes. As such, they remain a hypothetical alternative data source for regional and national trip analytics, an inference supported by the continued existence of transportation-oriented statistical products and research using these data in international markets, where they remain available. In fact, a return to cellular probe data in the United States is made more attractive as LBS data become less available, constrained by stricter privacy settings and a growing number of app developers opting not to resell user location data.
Extracting location information from these forms of cellular probe data remains technically complex and involves working with data forms and standards unfamiliar to those outside the telecommunications industry. The only people with direct access to cellular probe data messages in their original form are data analysts at mobile companies and employees at analytics firms. Unaltered cellular probe data are not available to the public. The methodology for extracting location information from these messages, a task performed by the analytics companies, is also unknown to the public.
These challenges have been exacerbated by the fact that United States-based transportation planners have not worked extensively with CDRs or passive-signaling data since 2018, meaning that assumptions about the data, as well as corresponding techniques for analyzing the data, are potentially out-of-date and not useable with newer call records. For example, call records usually incorporate satellites for location detection in ways not done before.
Fortuitously, standards for CDRs, the more widely known of the two forms of cellular location data, are still maintained and developed, which some technology vendors still implement. As such, information on cellular location data is still knowable, even if they are primarily used outside the United States; some literature, for instance, finds that CDRs, which describe interactions between a phone and a cellular network for a call, text, or data request, can be generated via network-management events.
Connected vehicle (CV) data are generated, collected, and transmitted by vehicles equipped with onboard systems and sensors that enable them to exchange data with other vehicles, infrastructure, cloud services, and mobile devices. These communications produce data with records of events that occur while the car is in use (e.g., tire pressure warnings, vehicles on/off events, airbag inflations, impacts, lane deviation warnings).
Car manufacturers play a pivotal role in the CV data ecosystem. They install small computers in vehicles that collect and send out data about the vehicle in standardized formats. The data in these messages are then ingested and interpreted by (1) other vehicles and (2) devices embedded in roadside infrastructure (i.e., roadside units [RSUs] attached to infrastructure, such as traffic signals).
The data produced through these interactions, in turn, contain information that can be used to better understand and illustrate many important facets of the transportation system (vehicle trajectory; vehicle operations; make and model; and safety-related items, such as tire pressure warnings and hard braking events) that are largely unavailable from other data sources at the same scale and precision.
When a CV is on and operating, for example, it continuously generates data packets containing the following information:
Vehicle dynamics (speed, heading, acceleration)
Vehicle safety status (brake status, stability control)
Precise positioning information
System operational parameters
The contents of these data packets are easily digestible and, therefore, transparently understood by planners and statisticians with access to the data, an outcome of standardized CV data architectures and methodologies developed and documented jointly by U.S. government and private companies.
While CV data have existed since the first vehicles equipped with the requisite technology went into operation in 2015, they remain an emerging probe data source, especially because only one known vehicle maker provides the disaggregate data packets to vendors, agencies, and research organizations for development into public-facing products. While alternative CV data-collection methods exist for technology developers with access to proprietary devices installed in CVs, these sources are not as readily accessible to practitioners. Likewise, CVs are not a representative sample of all vehicles in every region. Overrepresentation occurs in regions with a greater preference for CV-enabled vehicle makes and models.
An Electronic Logging Device (ELD) is a piece of hardware that records a commercial motor vehicle (CMV) driver’s driving time to automatically maintain their hours-of-service (HOS) records. Mandated by the Federal Motor Carrier Safety Administration (FMCSA), ELDs are required for most motor carriers and drivers, including commercial buses, all of whom are required to maintain a log to record their HOS, files otherwise known as Records of Duty Status (RODS). Intended to promote driver safety and ensure compliance with regulations, an ELD automatically records a driver’s RODS, eliminating the need for manual paper logs and promoting easier, more accurate HOS recordkeeping. These records, which form a type of probe data, are conceptually rich with information for transportation planners.
CMVs, the focus of the ELD requirement, are defined as vehicles used as part of a business that is involved in interstate commerce and that fit any of the following descriptions:
Weighs 10,001 pounds or more
Has a gross vehicle weight rating or gross combination weight rating of 10,001 pounds or more
Is designed or used to transport 16 or more passengers (including the driver) not for compensation
Is designed or used to transport 9 or more passengers (including the driver) for compensation
Is transporting hazardous materials in a quantity requiring placards
The HOS regulations that necessitate ELDs define the maximum amount of time drivers are permitted to be on duty (including driving time) and specify the number and length of rest periods necessary to help ensure drivers stay awake and alert. In general, all motor carriers and drivers operating CMVs must comply with the HOS regulations found in Title 49 Code of Federal Regulations 395.
ELDs automate and enhance the accuracy of HOS record-keeping, replacing traditional, driver-kept paper logs. Digitizing these records has improved safety, prevented fatigue-related accidents, and streamlined the tracking and management of driver hours. In generating accurate and tamper-proof records for a driver, ELDs reduce the risk of falsified logs and simplify adherence to HOS regulations by ensuring drivers and motor carriers stick to the HOS regulations (enforced via spot and regular audits), ultimately creating a safer road environment for all users.
Critical to transportation planners and statisticians, each ELD is embedded with a Global Positioning System (GPS) unit, which enriches individual records with location information. In fact, nearly all events the ELD records (e.g., trips and stop events) inherently have a geographic component, so the spatial data produced by the unit (and appended to all messages it generates) contextualize and confirm the logged events and even note other events that might have been forgotten or missed by the driver when relying on memory alone. While these methods are a digital evolution of earlier, paper-based record keeping, drivers still retain access to the records produced by their ELDs.
All the while, FMCSA’s ELD rules exempt some drivers in some situations, meaning the events that would have been logged from these types of trucks are not available for analysis.
ELD-generated probe data, in their disaggregate, original forms, are not readily available to public consumers, although some are available through aggregate products created by the ELD vendors and analytics companies with which they have data-sharing agreements. The underlying disaggregate ELD messages, which are collected and stored by the company that owns the trucks in which the devices are installed (in collaboration with the vendor of the device), are currently only available to outside entities in response to an FMCSA- or a state-directed enforcement of HOS laws. As such, this discussion of ELD-based probe data, especially metrics that would rely on ELD message attributes only available from data in their disaggregate form, as an input for transportation statistics is primarily conceptual.
Fleet GPS (Global Positioning System) data are a form of probe data generated by systems of hardware and software installed on and connected to a trucking fleet’s tractors. Fleet owners rely on these data—which are generated via onboard diagnostics (OBD) units that can be embedded in their trucks at manufacture or installed in cabs already on the road—to provide near-real-time information to track, monitor, and manage their vehicles and assets. The primary output of OBD units is GPS-derived location information, produced as a series of messages (commonly also known as “pings” or “traces”) indicating the precise location of the truck at a given timestamp, although these messages are often accompanied with information uniquely identifying the truck or driver and vehicle-sourced telematics data summarizing other aspects of the truck tractor’s performance (e.g., speed, harsh driving events, fuel consumption, and engine diagnostics).
Third-party manufacturers produce and sell OBD units, which contain the hardware and software needed to transmit data from a vehicle to its corresponding receiver. The same manufacturer may additionally maintain a secure, cloud-based fleet-management platform to store, analyze, and display the transmitted information, thereby supplying fleet owners with real-time and historic spatiotemporal information on each of their drivers and vehicles. Uses for these data vary by owner but may include enhancing operational efficiency, reducing costs through fuel and maintenance optimization, improving driver safety, ensuring regulatory compliance, and bolstering security against theft and unauthorized use.
While these data are primarily for internal consumption at each fleet, many companies realize their value for transportation planners and engineers (e.g., observed truck speeds, stop locations, and trip patterns) and have sought to capitalize on this market. As a result, fleet owners work with their OBD unit provider to anonymize their data (i.e., remove business-sensitive elements in their data related to their trucks and drivers, including any personally identifiable information [PII]) and sell them to transportation analytics companies and planners, who use them to build statistical products for public consumption. While the end user sees less information on each truck than the corresponding fleet owner, the remaining GPS points and attributes are a reservoir of data on freight mobility and fluidity for modeling, visualizations, policies, planning, and operations that are entirely unavailable from any other source.
Both flavors of fleet GPS probe data are summarized here: the more detailed, business-sensitive information used internally by trucking fleets and the anonymized and masked data points available to external consumers. The origins, uses, and applications of fleet GPS data are much broader than those for Electronic Logging Devices (ELDs), which—despite having a similar reliance on GPS-enabled devices embedded into trucks—singularly produce data to ensure truckers’ compliance with hours-of-service (HOS) regulations.
BTS truck fleet GPS data guide (ROSA-P)
Example statistics:
BTS uses freight truck GPS probe data for one experimental, statistical program: