Appendix C - Sample Design, Data Collection, and Estimation
Appendix C - Sample Design, Data Collection, and Estimation
OVERVIEW
The primary goal for the 2002 Commodity Flow Survey (CFS) is to estimate shipping volumes (value, tons, and ton-miles) by commodity and mode of transportation at varying levels of geographic detail. A secondary objective is to estimate the volume of shipments moving from one geographic area to another (i.e., flows of commodities between states, regions, etc.) by mode and commodity. A detailed description of the sample design for the 2002 CFS is provided below.
SAMPLE DESIGN
The sample for the 2002 Commodity Flow Survey (CFS) was selected using a stratified three-stage design in which the first-stage sampling units were establishments, the second-stage sampling units were groups of four 1-week periods (reporting weeks) within the survey year, and the third-stage sampling units were shipments.
First Stage
Sampling frame
To create the first-stage sampling frame, we extracted a subset of establishment records from the Business Register (formerly the Standard Statistical Establishment List) as of September 2001. The Business Register is a database of all known establishments located in the United States or its territories. (An establishment is a single physical location where business transactions take place or services are performed.) Establishments located in the United States, having nonzero payroll in 2000, and classified in mining (except oil and gas extraction), manufacturing, wholesale, or electronic shopping and mail order retail industries, as defined by the 1997 North American Industry Classification System (NAICS), were included on the sampling frame. Auxiliary establishments (e.g. warehouses and central administrative offices) with shipping activity were also included on the sampling frame. Auxiliary establishments are establishments that are primarily involved in rendering support services for other establishments within the same company, instead of for the public, government, or other business firms. All other establishments included on the sampling frame are referred to as nonauxiliary establishments.
Some portion of establishments classified in the Retail Trade sector in the 1997 Economic Census was expected to be classified in the Wholesale Trade sector in the 2002 Economic Census. Because we wanted complete coverage of the Wholesale Trade sector as defined for the 2002 Economic Census, the 2002 CFS sampling frame also included establishments that were classified in particular retail industries (automotive parts and accessories, tires, floor coverings, building materials, nursery and garden, and office supplies) in the 1997 Economic Census and had characteristics indicating that they were likely to be classified as wholesale in the 2002 Economic Census. Of the establishments selected for the 2002 CFS from this set of establishments, only those that were classified as wholesale in the 2002 Economic Census were used in the production of estimates for this report.
Establishments classified in forestry, fishing, utilities, construction, transportation, services, and all other retail industries were not included on the sampling frame. Farms and government-owned entities (except government-owned liquor stores) were also excluded from the sampling frame. The resulting frame comprised approximately 760,000 establishments.
For each establishment we extracted sales, payroll, number of employees, a six-digit NAICS code, name and address, and a primary identifier. We also computed a measure of size for each establishment. The measure of size was designed to approximate an establishment's annual total value of shipments for the year 2000.
All of the establishments included on the sampling frame had state, county, and place geographic codes. We used these codes to assign each establishment to one of the 273 metropolitan areas (MAs) defined as a combination of the metropolitan statistical areas (MSAs) and consolidated metropolitan statistical areas (CMSAs). Establishments not located in an MA were assigned to MA 9999.
Stratification
We stratified the sampling frame by geography and industry. Geographic strata were defined by a combination of the 50 states, the District of Columbia, and the top 50 metropolitan areas (MAs) based on their population in Census 2000. If a particular MA was not one of the 50 largest, then it was collapsed with the remaining MAs and non-MAs within the state in which the particular MA resided. We refer to these collapsed strata as Rest of State (ROS) strata. When an MA crossed state boundaries, we considered the size of each part of the MA relative to the MAs total measure of size when determining whether or not to create strata in each state in which the MA was defined. The industry strata were determined as follows. Within each of the geographic strata, we started with a total of 45 industry groups based on 1997 NAICS: three mining (four-digit NAICS); 21 manufacturing (three-digit NAICS); 18 wholesale (four-digit NAICS); 1 retail (NAICS 4541); and 2 auxiliary (NAICS 4931 and 5511). We then implemented a rule that states a particular industry stratum will be defined within a geographic stratum if it contributes at least 2 percent to its corresponding state total measure of size or it contributes at least 2 percent to the national total measure of size for the industry. Industry groups not meeting these criteria were combined into at most 12 new collapsed industry strata using a clustering algorithm. Because of potential differences in shipping patterns between auxiliary and nonauxiliary establishments, we created two industry strata of auxiliary establishments in every geographic stratum. We refer to a particular geographic-by-industry combination as a primary stratum. Also note that a separate stratum was created at the national level for those Retail Trade sector establishments that we included in our sample.
Sample size and allocation
To reduce the sampling variability of the estimates, we used a stratified design with a certainty component. Within each primary stratum, a boundary (or cutoff) that divides the certainty establishments from the noncertainty establishments was determined using the Lavallee-Hidiroglou algorithm. If an establishment's measure of size was greater than the cutoff, the establishment was selected with certainty. Establishments selected with certainty were sure to be selected and represent only themselves (i.e., had a selection probability of one and a sampling weight of one).
Because the 2002 sample was about half the size of the 1997 CFS sample, we were concerned about the ability of the sample to capture less frequent types of shipments (e.g., air, water, rail, and hazardous materials). After considering several different alternatives, we felt the best approach was to identify those establishments which made the bulk of these types of shipments in 1997 and then select them with certainty. To identify these establishments, we proceeded as follows.
We identified all establishments in the 1997 CFS sample that reported shipments made by air, water, or rail. We also identified those establishments that reported shipments of hazardous materials. For each of these establishments, we computed the percentage of the establishment's total value and tonnage accounted for by each of these types of shipments. Next, we matched these establishments to the sampling frame for the 2002 CFS and identified each establishment with measure of size less than the certainty boundary. For both value and tons, we then looked to see what percent of the total volume of shipments for each type of shipment was captured by selecting with certainty the top 50, top 100, or all establishments. We considered the top 50 establishments as those establishments making the largest volume of each type of shipment (air, water, rail, hazardous). Once these establishments were identified, we grouped them into one file and unduplicated them. This procedure added a total of about 500 certainty establishments.
Establishments not selected with certainty made up the noncertainty frame. We further stratified the noncertainty establishments within each primary stratum using the measure of size previously described. We refer to these measure-of-size strata as substrata of the primary strata. The measure of size stratification increased the efficiency of the sample design. The Dalenius-Hodges cumulative f rule was used to set the substratum boundaries. We then used optimum allocation to determine the sample size required within each substratum to meet a coefficient of variation constraint on an estimate of the total measure of size for the primary stratum. Within each substratum, a simple random sample of establishments was selected without replacement.
To arrive at the final sample size, we allocated additional establishments to some of the strata so that the minimum substratum sample size was two and the probability of selecting any establishment was no less than 1 in 100. In total, the first-stage sample comprised 51,005 establishments.
Second Stage
The frame for the second stage of sampling consisted of 52-weeks from January 6, 2002 to January 4, 2003. Each establishment selected into the 2002 CFS sample was systematically assigned to report for four reporting weeks-one in each quarter of the reference year. Each of the 4-weeks was in the same relative position of the quarter. For example, an establishment might have been requested to report data for the 5th, 18th, 31st, and 44th weeks of the reference year. In this instance, each reporting week corresponds to the 5th week of each quarter. Prior to assignment of weeks to establishments, we sorted the selected sample by primary stratum (state x metropolitan area x industry) and measure-of-size.
Third Stage
For each of the four reporting weeks in which an establishment was asked to report, we requested the respondent to construct a sampling frame consisting of all shipments made by the establishment in the reporting week. Each respondent was asked to count or estimate the total number of shipments comprising the sampling frame and to record this number on the questionnaire. For each assigned reporting week, if an establishment made more than 40 shipments during that week, we asked the respondent to select a systematic sample of the establishment's shipments and to provide us with information only about the selected shipments. If an establishment made 40 or fewer shipments during that week, we asked the respondent to provide information about all of the establishment's shipments made during that week; i.e., no sampling was required.
DATA COLLECTION
Each establishment selected into the CFS sample was mailed a questionnaire for each of its four reporting weeks. We mailed each establishment a questionnaire once every quarter of 2002. For a given establishment, we requested that the respondent provide the following information about each of the establishment's reported shipments: shipment identification number, the date on which the shipment was made, value, weight, commodity, mode(s) of transportation, domestic destination or port of exit, an indication of whether the shipment was an export, and the United Nations or North America (UN/NA) number for hazardous material shipments. For a shipment that included more than one commodity, the respondent was instructed to report the commodity that made up the greatest percentage of the shipment's weight. For an export shipment, we also asked the respondent to provide the mode of export and the foreign destination city and country. See Appendix E for a copy of the questionnaire.
IMPUTATION OF SHIPMENT VALUE OR WEIGHT
To correct for nonresponse to either the value or weight item for a given shipment reported in the CFS, the missing value or value that failed edit is replaced by a predicted value obtained from an appropriate model. Such a shipment is considered a ``recipient'' if its commodity code is valid and the other item is reported greater than zero and passed edit. The recipient's item that is missing or failed edit is imputed as follows. First, a ``donor'' shipment is randomly selected from shipments that were reported in the CFS with:
- The same commodity code as the recipient.
- Both value and weight items reported greater than zero and passed edit.
- Origin and value for the item reported by the recipient similar to those of the recipient.
Then, the donor's value and weight data are used to calculate a ratio, which is applied to the recipient's reported item, to impute the item that is missing or failed edit. If no donor is found, the median ratio for all shipments reported in the survey with the same commodity code as the recipient and with both value and weight items reported greater than zero is applied to the recipient's reported item. For either the value or weight item, about 3 percent of the shipment records input to the calculation of estimates have imputed data for the item.
ESTIMATION
Estimated totals (e.g., value of shipments, tons, ton-miles) are produced as the sum of weighted shipment data (reported or imputed). Percent change and percent-of-total estimates are derived using the appropriate estimated totals. Estimates of average miles per shipment are computed by dividing an estimate of the total miles traveled by the estimated number of shipments. The annualized growth rate A for estimates from year y1 to y2 is computed as:
where and are estimates of the value of shipments, tons, ton-miles, or average miles per shipment for years y1 and y2, respectively. The annualized growth rate measures the annual rate of change between estimates from any 2 years by assuming a constant yearly rate of change.
Each shipment has associated with it a single tabulation weight, which was used in computing all estimates to which the shipment contributes. The tabulation weight is a product of seven different component weights. A description of each component weight follows.
CFS respondents provided data for a sample of shipments made by their respective establishments in the survey year. For each establishment, we produced an estimate of that establishment's total value of shipments for the entire survey year. To do this, we used four different weights, the shipment weight, the shipment nonresponse weight, the quarter weight, and the quarter nonresponse weight.
Like establishments, we identified shipments as either certainty or noncertainty. (See the Nonsampling Error section in Appendix B for a description of how certainty shipments were identified.) For noncertainty shipments, the shipment weight was defined as the ratio of the total number of shipments (as reported by the respondent) made by an establishment in a reporting week to the number of sampled shipments for the same week. This weight uses data from the sampled shipments to represent all the establishment's shipments made in the reporting week. However, a respondent may have failed to provide sufficient information about a particular sampled shipment. For example, a respondent may not have been able to provide value, weight, or a destination for one of the sampled shipments. If this data item could not be imputed, then this shipment did not contribute to tabulations and was deemed unusable. (A usable shipment is one that has valid entries for value, weight, and origin and destination ZIP Codes.) To account for these unusable shipments, we applied the shipment nonresponse weight. For noncertainty shipments from a particular establishment's reporting week, this weight is equal to the ratio of the number of sampled shipments for the reporting week to the number of usable shipments for the same week. The shipment weight for certainty shipments from a particular establishment's reporting week is equal to one.
The quarter weight inflates an establishment's estimate for a particular reporting week to an estimate for the corresponding quarter. For noncertainty shipments, the quarter weight is equal to 13. The quarter weight for most certainty shipments is also equal to 13. However, if a respondent was able to provide information about all large (or certainty) shipments made in the quarter containing the reporting week, then the quarter weight for each of these shipments was one. For each establishment, the quarterly estimates were added to produce an estimate of the establishment's value of shipments for the entire survey year. Whenever an establishment did not provide the Census Bureau with a response for each of its four reporting weeks, we computed a quarter nonresponse weight. The quarter nonresponse weight for a particular establishment is defined as the ratio of the number of quarters for which the establishment was in business in the survey year to the total number of quarters (reporting weeks) for which we received usable shipment data from the establishment.
Using these four component weights, we computed an estimate of each establishment's value of shipments for the entire survey year. We then multiplied this estimate by a factor that adjusts the estimate using value of shipments and sales data obtained from other surveys and censuses conducted by the Census Bureau. This weight, the establishment-level adjustment weight, attempts to correct for any sampling or nonsampling errors that occur during the sampling of shipments by the respondent.
The adjusted value of shipments estimate for an establishment was then weighted by the establishment weight. This weight is equal to the reciprocal of the establishment's probability of being selected into the sample.
A final adjustment weight, the industry-level adjustment weight, uses information from other surveys and censuses conducted by the Census Bureau to account for establishments from which we did not receive a response (including establishments from which we did not receive any usable shipment data) and for changes in the population of establishments between the time the first-stage sampling frame was constructed (2001) and the year in which the data were collected (2002). Separate industry-level adjustment weights were determined for nonauxiliary and auxiliary establishments.