Statistical Policy and Research BTS Guide to Good Statistical Practice 3. Collection of Data
Statistical Policy and Research BTS Guide to Good Statistical Practice 3. Collection of Data
Given the collection design, the next phase in the data acquisition process is the collection process itself. This collection process can be a one-time execution of a survey, a monthly (or other periodic) data collection, a continuous reporting of incident data, or a compilation of data already collected by one or more third parties. The physical details of carrying out the collection are critical to making the collection design a reality.
Figure 4
3.1 Data Collection Operations
Principles
- Forms, questionnaires, automated collection screens, and file layouts are the medium through which data are collected. They consist of sets of questions or annotated blanks on paper or computer that request information from data suppliers. They need to be designed to maximize communication to the data supplier.
- Data collection includes all the processes involved in carrying out the data collection design to acquire data. Data collection operations can have a high impact on the ultimate data quality, especially when they deviate from the design.
- The data collection method should be appropriate to the data complexity, collection size, data requirements, and amount of time available.
For example, a reporting system will often primarily rely on the required reporting mechanism, with follow-up for missing data. Similarly, a large survey requiring a high response rate will often start off with a mail out, followed by telephone contact, and finally by a personal visit.
- Specific data collection environmental choices can significantly affect error introduced at the collection stage.
For example, if the data collector is collecting as a collateral duty or is working in a uncomfortable environment, it may adversely affect the quality of the data collected. Also, if the data are particularly difficult to collect, it will affect the data quality.
- Conversion of data on paper to electronic form (e.g., key entry, scanning) introduces a certain amount of error which must be controlled.
- Third party sources of data introduce error in their collection processes.
- Computer-assisted information collection can result in more timely and accurate information. Initial development costs will be higher, and much more lead time will be required to develop, program, and test the data collection system. However, the data can be checked and corrected when originally entered, key-entry error is eliminated, and the lag between data collection and data availability is reduced.
- The use of sensors for data can significantly reduce error.
Guidelines
- Forms, screens, or file layouts used for data collection are clearly defined for data suppliers, with entries in a logical sequence, reasonable visual cues, and limited skip patterns. Instructions should help minimize missing data and response error.
- Computer assisted collection should be considered when the collection is repetitive over a long period of time making the gains in quality and data processing time worth the expense. Use of sensors (e.g., GPS, counters) should be considered to reduce error.
Examples: Central telephone interviewing with computer screens and data entry by the interviewer. Handheld devices for entering train inspection data on-scene. Traffic counters with automatic upload to a central location.
- A status tracking system should be used to ensure that data are not lost in mailings, file transfers, or collection handling.
- Data entry of paper forms should have a verification system based on data accuracy requirements.
For example, the verification samples of key entry forms can be based on an average outgoing quality limit for batches of forms. A somewhat more expensive approach would be 100 percent verification.
- Make the data collection as easy as possible for the collector.
- If interviewers or observers are used, a formal training process should be established to ensure proper procedures are followed.
- Data calculations and conversions at the collection level should be minimized.
For example, if a bus driver is counting passengers, they should not be doing calculations such as summations. The driver should record the raw counts and calculations should be performed where they are less likely to result in mistakes.
- The collection operation procedures should be documented and clearly posted with the data, or with disseminated output from the data.
References
- Federal Committee on Statistical Methodology. 1983. Approaches to Developing Questionnaires. Washington, DC: U.S. Office of Management and Budget (Statistical Policy Working Paper 10).
- Groves, R. 1989. Survey Errors and Survey Costs. New York, NY: Wiley, Chs. 10 & 11.
3.2 Missing Data Avoidance
Principles
- Some missing data occur in almost any data collection effort. Unit-level missing data occur when a report that should have been received is completely missing or is received and cannot be used (e.g., garbled data, missing key variables). Item-level missing data occur when data are missing for one or more items in an otherwise complete report.
For example, for an incident report for a hazardous material spill, unit-level missing data occur if the report was never sent in. It would also occur if it was sent in, but all entries were obliterated. Item-level missing data would occur if the report was complete, except it did not indicate the quantity spilled.
- The extent of unit-level missing data can sometimes be difficult to determine. If a report should be sent in whenever a certain kind of incident occurs, then non-reporters can only be identified if crosschecked with other data sources. On the other hand, if companies are required to send in periodic reports, the previous period may provide a list of the expected reporters for the current periods.
Both can also be true for item-level missing data. For example, in a travel survey asking for trips made, forgotten trips would not necessarily be known.
- Some form of missing data follow-up will dramatically reduce the incident of both unit-level and item-level missing data.
For example, a system to recontact the data source can be used, especially when critical data are left out. A series of recontacts may be used for unit nonresponse. Incident reporting systems can use some form of cross-check with other data sources to detect when incidents occur, but are not reported.
- When data are supplied by a third-party data collector, some initial data check and follow-up for missing data will dramatically reduce the incident of missing data.
Guidelines
- Data collection programs should be conducted in a manner that is likely to produce high rates of response.
- All data collection programs require some follow-up of missing reports and data items, even if the data are provided by third-party sources.
For example, for surveys and periodic reports, it is easy to tell what is missing at any stage and institute some form of contact (e.g., mail out, telephone contact, or personal visit) to fill in the missing data. For incident reports, it is a little more difficult, as a missing report may not be obvious.
- For incident reporting systems where missing reports may not be easily tracked, some form of checking system should exist to reduce missing reports.
- When collecting data from units of varying sizes (e.g., companies), the follow-up scheme should be prioritized, re-contacting larger reporters first, possibly at the risk of missing smaller reporters.
- For missing data items the data collection sponsor should distinguish between: critical items like items legally required or otherwise important items (e.g., items used to measure DOT or agency performance).
- The missing data avoidance procedures should be documented and clearly posted with the data, or with disseminated output from the data.
References
- Groves, R.M. and M.P. Couper. 1998. Nonresponse in Household Interview Surveys. New York, NY: Wiley.