Chapter 2 - Planning Data Systems

Wednesday, December 21, 2011

Data systems produced within a DOT agency are created to fulfill user needs. Users can be those within DOT and outside. The data are compiled to satisfy an external user need, measure success toward a strategic goal (internal user), or used as a tool necessary to perform work toward a goal (internal user). Data system planning consists of four stages: collecting user needs, development of objectives for the system, translation of those objectives into data requirements, and planning of the top-level methods that will be used to acquire the data.

Figure 1

Text Version

Figure 1. If you are a user with a disability and cannot view this image, use the text version. Please call 800-853-1351 or email answers@bts.gov for further assistance.

2.1 Data System Objectives

Principles

A "data system" is any collection of information that is used as a source by any Government entity to disseminate information to the public, along with the planning, collection, processing, and evaluation. A data system can cover any combination of information treated as a single system for the sake of documentation and other guideline issues.

The "system owner" as used in these guidelines is the organizational entity whose strategic plan and budget will guide the creation or continued maintenance of the data system.

"Users" of a data system are people or organizations who use information products that incorporate data from the system, either in raw form or in statistics. "Major Users" of the data system are system users identified as such in strategic plans and legislation supporting the creation and maintenance of the data system. "User needs" should be in the form of questions that specific users want to be answered.

"Objectives" of the data system describe what federal programs and external users will accomplish with the information.

System objectives in clear, specific terms, identifying data users and key questions to be answered by the data system, will help guide the system development to produce the results required.

Just as user needs change over time, the objectives of the data system will need to change over time to meet new requirements.

Users will benefit from knowing the objectives that guided the system design.

Guidelines

Data system objectives should be written in terms of the questions that need to be answered by the data; not in terms of the data itself.

Every data system objective should be traceable to user needs.

For example, NHTSA, as an internal user of the Fatality Analysis Reporting System (FARS) has a primary goal to improve traffic safety and a need for information related to that goal. So, one objective for the Fatality Analysis Reporting System (FARS) could be to provide an overall measure of highway safety to evaluate the effectiveness of highway safety improvement efforts.

The system owner should develop and update the data system objectives in partnership with critical users and stakeholders. The owner should have a process to regularly update the system as user needs change.

For example, for the Highway Performance Monitoring System (HPMS), one of the objectives may be: to provide state and national level measures of the overall condition of the nations public roads for Congress, condition and performance information for the traveling public, and information necessary to make equitable apportionments of highway funds to the states. The specific needs of major users have to be monitored and continuously updated.

Objectives should include timeliness of the data related to user needs.

The current data system objectives should be documented and made available to the public, unless restricted.

The updating process should be documented and include how user information is collected.

References

Huang, K., Y.W. Lee, and R.Y. Wang. 1999. Quality Information and Knowledge. Saddle River, NJ: Prentice Hall.

2.2 Data Requirements

Figure 2

Text Version

Principles

An "empirical indicator" is a characteristic of people, businesses, objects, or events (e.g., people or businesses in a city or state, cars or trains in the United States, actions at airports, incidents on highways).

Examples: The level of success in stopping illicit drug smuggling into the U.S. over maritime routes. The level of use of public transit in a metropolitan area.

Before deciding on what data should be in a data system or how to acquire them, the data system objectives need to be linked to more specific "empirical indicator," from which data requirements will be derived.

Example: For FARS, the objective "To provide an overall measure of highway safety" leads to an empirical indicator of "Injury or death of people on the highways of the U.S."

Empirical indicators related to objectives can be outcomes that change as objectives are achieved, outputs from agency accomplishments related to an objective, efficiency concepts, inputs, and quality of work.

From the empirical indicators, data requirements are created for possible measurement of each empirical indicator.

Maintaining the link from data system objectives to empirical indicators to data requirements will help to ensure "relevance" of the data to users.

In the data requirements, the use of standard names, variables, numerical units, codes, and definitions allow data comparisons across databases.

Besides data that are directly related to strategic plans, additional data may be required for possible cause and effect analysis.

For example, data collected for traffic crashes may include weather data for causal analysis.

Guidelines

Each data system objective should have one or more "indicators" that need to be measured. Characteristics or attributes of the target group that are the focus of the objective should be covered by one or more empirical indicators.

For HPMS, the objective "to provide a measure of highway road use" can lead to the empirical indicator of "the annual vehicle miles of travel on the interstate system & other principle arteries."

The empirical indicators should be those characteristics which, when changing in a favorable way, indicate progress toward achievement of an objective.

Note: Exceptions to this description are measures of magnitude, such as a total population or total vehicle miles traveled. These are "denominator measures" used to allow comparisons over time.

Once the empirical indicators are chosen, develop data requirements needed to quantify them.

Example: For HPMS, the empirical indicator, "the annual vehicle miles of travel on the interstate system & other principle arteries" can lead to a data requirement for state-level measures of annual vehicle-miles traveled accurate to within 10 percent at 80 percent confidence.

There is usually more than one way to quantify an empirical indicator. All reasonable measures should be considered without regard to source or availability of data. The final data choices will be made in the "methods" phase based on ease of acquisition, constraining factors (e.g., cost, time, legal factors), and accuracy of available data.

Example: A concept of commercial airline travel "delay" can be measured as a percent of flights on-time in accordance with schedule, or a measure of average time a passenger must be in the airport including check in, security, and flight delay (feasibility of measure is not considered at this stage).

In the data requirements, each type of data should described in detail. Key variables should include requirements for accuracy, timeliness, and completeness. The accuracy should be based on how the measure will be used for decision-making.

Example: For FARS, the concept, "The safety of people and pedestrians on the highways of the U.S." can lead to data requirements for counts of fatalities, injuries, and motor vehicle crashes on U.S. highways and streets. The fatalities for a fiscal year should be as accurate as possible (100% data collection), available within three months after the end of the fiscal year, and as complete as possible. The injury counts in traffic crashes for the fiscal year totals should have a standard error of no more than 6 percent, be available within three months after the end of the fiscal year, and have an accident coverage rate of at least 90 percent.

When selecting possible data, consider standardization with other databases. First, consider measures used for similar concepts in other DOT databases. Second, consider measures for similar concepts in databases outside DOT (e.g., The Census). Coding standards should be used where coding is used and made part of the data requirements. Such standardization leads to "coherence" across datasets.

Examples: the North American Industry Classification System (NAICS) codes, the Federal Information Processing Standards (FIPS) for geographic codes (country, state, county, etc.), the Standard Occupation Codes (SOC), International Organization for Standardization (ISO) codes (money, countries, containers)

The current data system empirical indicators and data requirements should be documented and clearly posted with the data.

References

The Federal Information Processing Standards (FIPS) home page, http://www.itl.nist.gov/fipspubs/.
The North American Industry Classification System, http://www.census.gov/epcd/www/naics.html.
OMB Primer on Performance Management dated 2/28/1995.
American Association for Public Opinion Research. 1998. "Standard Definitions Final Dispositions of Case Codes and Outcome Codes for RDD Telephone Surveys and In-Person Household Surveys," http://www.aapor.org/ethics/stddef.html.

2.3 Methods to Acquire Data

Given data requirements for a wide range of possible measures, the next phase is to consider the realities associated with gathering the data to construct estimates and perform analysis. After looking at the ease of data acquisition, complexity of possible acquisition approaches, budget restrictions, and time considerations, the list of possible measures is likely to be reduced to a more reasonable level. First, consider possible sources of data and then the process of acquiring it.

Figure 3

Text Version

Figure 3. If you are a user with a disability and cannot view this image, use the text version. Please call 800-853-1351 or email answers@bts.gov for further assistance.

The more critical data needs invariably require greater accuracy. This in turn usually leads to a more complex data collection process. As the process gets more complex, there is no substitute for expertise. If the expertise for a complex design is not available in-house, consider acquiring the expertise by either contacting an agency specializing in statistical data collection like the Bureau of Transportation Statistics or by getting contractual support.

2.4 Sources of Data

Principles

A common arrangement in transportation is a reporting collection in which the target group automatically sends data. Most of these are dictated by law or regulation. That limits the collection planning to working out the physical details.

For example: 46 USC Chapter 61 specifies a marine casualty reporting collection, while 46 CFR 4.05 specifies details.

If existing data can be found that addresses data requirements, it is by far the most efficient (i.e., cheapest) approach to data acquisition. Sources of existing data can be current data systems or administrative records.
"Administrative records" are data that are created by government agencies to perform facilitative functions, but do not directly document the performance of mission functions (National Archives definition). In addition to providing a source for the data itself, administrative records may also provide information helpful in the design of the data collection process (e.g., sampling lists, stratification information).

For example, state drivers license records, social security records, IRS records, boat registration records, mariner license records.

Another method, less costly than developing a new data collection system, is to use existing data collections tailored to your needs. The owner of such a system may be willing to add additional data collection or otherwise alter the collection process to gather data that will meet data requirements.

For example, the Bureau of Transportation Omnibus survey is a monthly transportation survey that will add questions related to transportation for special collections of data from several thousand households. This method could be used if this process is accurate enough for the data system needs.

The "target group" is the group of all people, businesses, objects, or events about which information is required.

For example, the following could be target groups: all active natural gas pipelines in the U.S. on a specific day, traffic crashes in FY2000 involving large trucks, empty seat-miles on the MARTA rail network in Atlanta on a given day, hazardous material incidents involving radioactive material in FY2001, mariners in distress on a given day, and U.S. automobile drivers.

One possible approach is to go directly to the "target group," either all of them (100%) or a sample of them. This would work with people or businesses.
Another method frequently necessary with transportation data is the use of third party sources. Third party sources are people, businesses, or even government entities that have knowledge about the target group or collect information for other purposes, such as investigators, observers, or service providers (e.g., doctors).

Examples: traffic observers, police observers, investigators, bus drivers counting passengers, state data collectors.

Guidelines

Research whether government and private data collections already have data that meet the data requirements. Consider surveys, reporting collections, and administrative records.
If existing data meet some but not all of the data requirements, determine whether the existing data systems can be altered to meet the data needs.

For example, another agency may be willing to add to or alter their process in exchange for financial support.

A primary consideration in whether to gather data from the target group or an indirect source is access to the group; all of them. A 100% data gathering would obviously need access to the entire target group. A sample approach will not include the entire target group, but all members should have a non-zero (and known) probability of selection, or the sampling will not necessarily be representative of the target group.
Consider getting information directly from the target group (if they are people or businesses), having the target group observed (events as they occur), or getting information about the target group from another source (third party source discussed above).
In some situations, the information desired is not directly available. In this case, consider collecting related information that can be used to derive or estimate the information required.

For example: Collecting the number of people on and off a bus at each stop combined with a separate estimate of trip length between stops to estimate passenger miles.

When using third-party data for a data system, ensure that the data from the third party meets data requirements. If the third party source is mandated or a sole source for the data, gather information on each data requirement, as available.
The choices made for sources and their connection to the data requirements should be documented and clearly posted with the data, or with disseminated output from the data.

References

Electronic Records Work Group Report to the National Archives and Records Administration dated September 14, 1998.

2.5 Data Collection Design

Principles

The design of data collection is one of the most critical phases in developing a data system. The accuracy of the data and of estimates derived from the data are heavily dependent upon the design of data collection.

For example, the accuracy is dependent upon proper sample design, making use of sampling complexity to minimize variance. The data collection process itself will also determine the accuracy and completeness of the raw data.

Data collection from 100% of the target group is usually the most accurate approach, but is not always feasible due to cost, time, and other resource restrictions. It also is often far more accurate than the data requirements demand and can be a waste of resources.
A "probability sample" is an efficient way to automatically select a data source representative of the target group with the accuracy determined by the size of the sample.
When sampling people, businesses, and/or things, sampling lists (also known as frames) of the target group are required to select the sample. Availability of such lists is often a restriction to the method used in data collection.
For most statistical situations, it is usually important to be able to estimate the variance along with estimating the mean or total.
Sample designs should be based on established sampling theory, making use of multi-staging, stratification, and clustering to enhance efficiency and accuracy.
Sample sizes should be determined based on the data requirements for key data, taking into account the sample design and missing data.

Guidelines

The data collection designer should use a probability sample, unless a 100% collection is required by law, necessitated by accuracy requirements, or turns out to be inexpensive (e.g., data readily available).

For example, a system that collects data to estimate the total vehicle miles traveled (VMT) for a state of the U.S. cannot possibly collect 100 percent of all trips on every road, so a sampling approach is necessary. However, when it comes to collecting passenger miles for a large transit system, it may be possible with fare cards and computer networks to collect 100% of passenger miles.

The sample design should give all members of the target group a non-zero (and known) probability of being represented in the sample.

DANGER => Samples of convenience, such as collecting transportation counts at an opportune location, will produce data, but it will almost always be biased. Whereas, randomly selecting counting sites from all possible locations will be statistically sound (with allowances due to correlations between locations).

The design of any samples should be based on established sampling theory. Determine sample size using appropriate formulas to ensure data requirements for accuracy are met with adjustments for sample design and missing data. Use an appropriate random method to select sample according to the design.
If some form of sampling is used, design the data collection to collect sufficient information to estimate the variance of each estimate to be produced.
The collection design and its connection to the data requirements should be documented and clearly posted with the data, or with disseminated output from the data. The documentation should include references for the sampling theory used.
If the data collection process performed by DOT uses sampling, a statistician or other sampling expert should develop or review the design.
If the data system uses third party data collected using sampling, sample design information should be collected and provided with collection design documentation, when available

References

Cochran, William G., Sampling Techniques (3^rd Ed.), New York: Wiley, 1977.

USA Banner

Chapter 2 - Planning Data Systems