Statistical Policy and Research BTS Guide to Good Statistical Practice 2. Planning Data Systems
Statistical Policy and Research BTS Guide to Good Statistical Practice 2. Planning Data Systems
A data system produced within a DOT agency is linked to that organizations strategic planning. The data are compiled to measure success toward a goal, satisfy an external user need (which should also be a goal), or used as a tool necessary to perform work toward a goal. Data system planning consists of three stages: development of objectives for the system, translation of those objectives into data requirements, and planning of the top-level methods that will be used to acquire the data.
Figure 1
2.1 Data System Objectives
Principles
- The "system sponsor" and "sponsoring organization" as used in these guidelines is the organizational entity whose strategic plan and budget will guide the creation of the data system. It is usually at the agency level.
- These guidelines assume that the sponsoring organizations strategic plan is current and contains all of its goals and objectives, including those relative to the creation of the data system.
- "Objectives" of the data system describe what federal programs and external users will accomplish with the information. They should be traceable to the strategic plan goals.
- System objectives in clear, specific terms, identifying data users and key questions to be answered by the data system, will help guide the system development to produce the results required.
- Just as strategic plans change over time, the objectives of the data system will need to change over time to meet new requirements.
- Users will benefit from knowing the objectives that guided the system design.
Guidelines
- Every data system objective should be traceable to the goals and objectives in the sponsoring organizations strategic plan.
For example, NHTSAs primary goal is to improve traffic safety, so one initial objective for the Fatality Analysis Reporting System (FARS) could be to provide an overall measure of highway safety as an objective basis to evaluate the effectiveness of highway safety improvement efforts.
- The system sponsor should develop and update the data system objectives in partnership with critical users and stakeholders. The sponsor should have a system to regularly update the system as user needs change.
- The objectives should indicate each major need that will be fulfilled by the system and the data users associated with that need, and the key questions that will be answered by the data.
For example, for the Highway Performance Monitoring System (HPMS), one of the objectives may be: to provide state and national level measures of the overall condition of the nations public road systems as investment information for Congress, condition and performance information for the traveling public, and information necessary to make equitable apportionments of highway funds to the states.
- Objectives should include timeliness of the data.
- The current data system objectives should be documented and clearly posted with the data, or with the disseminated output from the data.
- The updating system should be documented and include how user information is collected.
References
- Huang, K., Y.W. Lee, and R.Y. Wang. 1999. Quality Information and Knowledge. Saddle River, NJ: Prentice Hall.
2.2 Data Requirements
Figure 2
Principles
- A "measurement concept" is a characteristic of people, businesses, objects, or events (e.g., people or businesses in a city or state, cars or trains in the United States, actions at airports, incidents on highways).
Examples: The level of success stopping illicit drug smuggling into the U.S. over maritime routes. The level of use of public transit in a metropolitan area.
- Before deciding on what data should be in a data system or how to acquire them, the data system objectives need to be linked to more specific "measurement concepts," from which data requirements will be derived.
Example: For FARS, the objective "To provide an overall measure of highway safety" leads to the measurement concept of "The safety of people and pedestrians on the highways of the U.S."
- Measurement concepts related to objectives can be outcomes that change as objectives are achieved, outputs from agency accomplishments related to an objective, efficiency concepts, inputs, and quality of work.
- From the measurement concepts, data requirements are created for possible measurement of each measurement concept.
- Maintaining the link from data system objectives to measurement concepts to data requirements will help to ensure "relevance" of the data to users.
- In the data requirements, the use of standard names, variables, numerical units, and codes allow data comparisons across databases.
- Besides data that are directly related to strategic plans, additional data may be required for possible cause and effect analysis.
For example, data collected for traffic accidents may include weather data for causal analysis.
Guidelines
- Each data system objective should have one or more "concepts" that need to be measured. Characteristics or attributes of the target group that are the focus of the objective should be covered by one or more measurement concepts.
For HPMS, the objective "to provide a measure of highway road use" can lead to the measurement concept of "the annual volume of vehicles on state and interstate roads."
- The measurement concepts should be those characteristics which, when changing in a favorable way, indicate progress toward achievement of an objective.
Note: Exceptions to this description are measures of magnitude, such as a total population or total vehicle miles traveled. These are "denominator measures" used to allow comparisons over time.
- Once the measurement concepts are chosen, develop data requirements needed to quantify them.
Example: For HPMS, the measurement concept, "the annual volume of vehicles on state and interstate roads" can lead to a data requirement for state-level measures of annual vehicle-miles traveled accurate to within 10 percent at 80 percent confidence.
- There is usually more than one way to quantify a measurement concept. All reasonable measures should be considered without regard to source or availability of data. The final data choices will be made in the "methods" phase based on ease of acquisition, constraining factors (e.g., cost, time, legal factors), and accuracy of available data.
Example: A concept of commercial airline travel "delay" can be measured as a percent of flights on-time in accordance with schedule, or a measure of average time a passenger must be in the airport including check in, security, and flight delay (feasibility of measure is not considered at this stage).
- The data requirements for each type of data should include required accuracy, timeliness, and completeness. The accuracy should be based on how the measure will be used for decision-making.
Example: For FARS, the concept, "The safety of people and pedestrians on the highways of the U.S." can lead to data requirements for counts of fatalities, injuries, and motor vehicle crashes on U.S. highways and streets. The fatalities for a fiscal year should be as accurate as possible (100% data collection), available within three months after the end of the fiscal year, and as complete as possible. The injury counts in traffic accidents for the fiscal year totals should have a standard error of no more than 6 percent, be available within three months after the end of the fiscal year, and have an accident coverage rate of at least 90 percent.
- When selecting possible data, consider standardization with other databases. First, consider measures used for similar concepts in other DOT databases. Second, consider measures for similar concepts in databases outside DOT (e.g., The Census). Coding standards should be used where coding is used and made part of the data requirements. Such standardization leads to "coherence" across datasets.
Examples: the North American Industry Classification System (NAICS) codes, the Federal Information Processing Standards (FIPS) for geographic codes (country, state, county, etc.), the Standard Occupation Codes (SOC), International Organization for Standardization (ISO) codes (money, countries, containers)
- The current data system measurement concepts and data requirements should be documented and clearly posted with the data.
References
- The Federal Information Processing Standards (FIPS) home page, http://www.itl.nist.gov/fipspubs/
- The North American Industry Classification System, http://www.census.gov/epcd/www/naics.html
- OMB Primer on Performance Management dated 2/28/1995
- American Association for Public Opinion Research. 1998. "Standard Definitions Final Dispositions of Case Codes and Outcome Codes for RDD Telephone Surveys and In-Person Household Surveys." http://www.aapor.org/ethics/stddef.html.
2.3 Methods to Acquire Data
Given data requirements for a wide range of possible measures, the next phase is to consider the realities associated with gathering the data to construct estimates and perform analysis. After looking at the ease of data acquisition, complexity of possible acquisition approaches, budget restrictions, and time considerations, the list of possible measures is likely to be reduced to a more reasonable level. First, consider possible sources of data and then the process of acquiring it.
Figure 3
The more critical data needs invariably require greater accuracy. This in turn usually leads to a more complex data collection process. As the process gets more complex, there is no substitute for expertise. If the expertise for a complex design is not available in-house, consider acquiring the expertise by either contacting an agency that specializing in statistical data collection like the Bureau of Transportation Statistics or by getting contractual support.
2.4 Sources of Data
Principles
- A common arrangement in transportation is a reporting system in which the target group automatically sends data. Most of these are dictated by law or regulation. That limits the collection planning to working out the physical details.
For example: 46 USC Chapter 61 specifies a marine casualty reporting system, while 46 CFR 4.05 specifies details.
- Use of existing data is by far the most efficient (i.e., cheapest) approach to data acquisition. Sources of existing data can be current data systems or administrative records.
- "Administrative records" are data that are created by government agencies to perform facilitative functions, but do not directly document the performance of mission functions (National Archives definition). In addition to providing a source for the data itself, administrative records may also provide information helpful in the design of the data collection system (e.g., sampling lists, stratification information).
For example, state drivers license records, social security records, IRS records, boat registration records, mariner license records.
- Another method, less costly than developing a new data collection system, is to use existing data collections tailored to your needs. The sponsor of such a system may be willing to add additional data collection or otherwise alter the collection process to gather data for your needs.
For example, the Bureau of Transportation Omnibus survey is a monthly transportation survey that will add questions related to transportation for special collections of data from several thousand households. This method could be used if this process is accurate enough for the data system needs.
- The "target group" is the group of all people, businesses, objects, or events about which information is required.
For example, the following could be target groups: all active natural gas pipelines in the U.S. on a specific day, traffic accidents in FY2000 involving large trucks, empty seat-miles on the MARTA rail system in Atlanta on a given day, hazardous material incidents involving radioactive material in FY2001, mariners in distress on a given day, and all U.S. automobile drivers.
- One possible approach is to go directly to the "target group," either all of them (100%) or a sample of them. This would work with people or businesses.
- Another method frequently necessary with transportation data is the use of third party sources. Third party sources are people, businesses, or even government entities that have knowledge about the target group or collect information for other purposes, such as investigators, observers, or service providers (e.g., doctors).
Examples: traffic observers, police observers, investigators, bus drivers counting passengers, state data collectors.
Guidelines
- Research whether government and private data gathering systems already have data that meet the data requirements. Consider surveys, reporting systems, and administrative records.
- If existing data meet some but not all of the data requirements, determine whether the existing data collection system can be altered to meet the data needs.
For example, another agency may be willing to add to or alter their process in exchange for financial support.
- A primary consideration in whether to gather data from the target group or an indirect source is the access to those sources; all of those sources. A 100% data gathering would obviously need access to the entire target group. A sample approach will not include the entire target group, but all members should have a non-zero probability (and known) of selection, or the sampling will not necessarily be representative of the target group.
- Consider getting information directly from the target group (if they are people or businesses), having the target group observed (events as they occur), or getting information about the target group from another source (third party source discussed above).
- In some situations, the information desired is not directly available. In this case, consider collecting related information that can be used to derive or estimate the information required.
For example: Collecting the number of people on and off a bus at each stop combined with a separate estimate of trip length between stops to estimate passenger miles.
- The choices made for sources and their connection to the data requirements should be documented and clearly posted with the data, or with disseminated output from the data.
References
- Electronic Records Work Group Report to the National Archives and Records Administration dated September 14, 1998.
2.5 Data Collection Design
Principles
- The design of data collection is one of the most critical phases in developing a data system. The accuracy of the data and of estimates derived from the data are heavily dependent upon the design of data collection.
For example, the accuracy is dependent upon proper sample design, making use of sampling complexity to minimize variance. The data collection process itself will also determine the accuracy and completeness of the raw data.
- For large target groups, data collection from 100% of the target group is usually the most accurate approach, but is not always feasible due to cost, time, and other resource restrictions. It also is often far more accurate than the data requirements demand and can be a waste of resources.
- A "probability sample" is an efficient way to automatically select a data source representative of the target group with the accuracy determined by the size of the sample.
- When sampling people, businesses, and/or things, sampling lists (also known as frames) of the target group are required to select the sample. Availability of such lists is often a restriction to the method used in data collection.
- For most statistical situations, it is usually important to be able to estimate the variance along with estimating the mean or total.
- Sample designs should be based on established sampling theory, making use of multi-staging, stratification, and clustering to enhance efficiency and accuracy.
- Sample sizes should be determined based on the data requirements for key data, taking into account the sample design and missing data.
Guidelines
- If the target group is large, the data collection designer should use a probability sample, unless a 100% collection is required by law, necessitated by accuracy requirements, or turns out to be inexpensive (e.g., data readily available).
For example, a system that collects data to estimate the total vehicle miles traveled (VMT) for a state of the U.S. cannot possibly collect 100 percent of all trips on every road, so a sampling approach is necessary. However, when it comes to collecting passenger miles for a large transit system, it may be possible with fare cards and computer systems to collect 100% of passenger miles.
- The sample design should give all members of the target group a non-zero (and known) probability of being represented in the sample.
DANGER => Samples of convenience, such as collecting transportation counts at an opportune location, will produce data, but it will almost always be so biased as to be useless. Whereas, selecting locations using all possible locations in a sampling system will be statistically sound (with allowances due to correlations between locations).
- The design of any samples should be based on established sampling theory.
- Determine sample size using appropriate formulas to ensure data requirements for accuracy are met with adjustments for sample design and missing data. Use an appropriate random method to select sample according to the design.
- If some form of sampling is used, design the data collection to collect sufficient information to estimate the variance of each estimate to be produced.
- The collection design and its connection to the data requirements should be documented and clearly posted with the data, or with disseminated output from the data. The documentation should include references for the sampling theory used.
- If the data collection process uses sampling, a statistician or other sampling expert should develop or review the design.
References
- Cochran, William G., Sampling Techniques (3rd Ed.), New York: Wiley, 1977.