Estimation of Domestic CFS Shipments

Thursday, September 15, 2016

4.1 OVERVIEW

The 2012 CFS covers approximately 70% of the domestic freight volumes by dollar value that FAF4 intends to capture, the remaining 30% being shipped by businesses outside the CFS scope. The CFS reports origin, destination, commodity, and mode (ODCM) activity of covered sectors, by tons and dollar values, but not all data cells are released for two reasons. First, measured and expanded activity captured by the CFS survey may be suppressed from the published tables due to (a) protection of the confidentiality of identifiable shippers, and (b) statistical reliability problems in the estimates (namely coefficients of variation (CV) above 50%). These are cells with no quantities reported, but where activity did occur. Secondly, as a sample survey, a certain activity may not be captured by the CFS because it occurred in an establishment, or on a day, that was not sampled, which is a sampling limitation.

The FAF’s intent with the CFS component is to reproduce those shipments actually captured by the CFS. It is not to estimate the quantity and location of missed shipments, nor to estimate the probabilities or potential of movements occurring, regardless of whether the shipments were ever realized. In other words, the FAF process is to estimate what the CFS would show if there were no suppression. At the most detailed ODCM level, more cells are suppressed for confidentiality or reliability reasons, or because expanded movements are rounded to zero, than there are filled cells, although the preponderance of U.S. movements (in terms of volume) do occur in unsuppressed cells.

For the FAF process, Census provided a special tabulation of domestic-only movements (i.e., excluding exports) with a looser CV threshold of 100%. This special dataset also included a count of shipments in each ODCM cell so that “zero cells” which had positive activity could be distinguished from the true zeros. The main effort in this component of FAF is to estimate suppressed cells for a comprehensive ODCM matrix on domestic CFS shipments.

4.2 ESTIMATION PROCESS

4.2.1 A Log-linear Model of Effects

The FAF process assumes that any value in the ODCM matrix is the product of a set of unknown but estimable effects. In the simplest model of independence, it is assumed that any ODCM tonnage is the product of four separate effects due to origin, destination, commodity, and mode, which can be mathematically expressed as:

U(o,d,c,m) = e[O](o) * e[D](d) * e[C](c) * e[M](m) [1]

where U is the ODCM flow matrix (with measured or estimated values), capital letters (e.g., "O") are a particular dimension, and lower case letters are given categories in the dimension. For example, the tonnage of coal (SCTG 2-digit code of ‘15’) shipped between West Virginia (FAF zone 540) and Baltimore (FAF zone 241) by rail (mode ‘2’) would be

U(540,241,15,2) = e[O](540) * e[D](241) * e[C](15) * e[M](2).

Here, each effect, say e[O], is a vector with a cell for each of the possible 132 origin zones. Thus the term e[O](540) is the origin effect for West Virginia.

This first-level approximation is clearly inadequate because, by assuming independence, it fails to account for the interaction effects between categories. For instance, distant origins and destinations should typically have lower volumes than nearby ones. Thus, by considering a second-level interaction effect e[OD](o,d) that influences total flows, the flow model in [1] can now be expressed as

U(o,d,c,m) = e[O](o) * e[D](d) * e[C](c) * e[M](m) * e[OD](o,d) [2]

A spatial interaction model will estimate e[OD] with a specific functional form based on the cost of interaction between o and d, but the interest here is providing the best estimate of the e[OD] 2-dimensional matrix that will make the flow estimates U closest to measurements. Likewise, there may be an origin-commodity effect e[OC], or a commodity-mode interaction effect e[CM], etc., that also needs to be considered. Therefore, by including all possible 2-dimensional effects in the model, the equation becomes

U(o,d,c,m) = e[O](o) * e[D](d) * e[C](c) * e[M](m) *
e[OD](o,d) * e[OC](o,c) * e[OM](o,m) * e[DC](d,c) *
e[DM](d,m) * e[CM](c,m) [3]

Similarly, the third-order effects and a fourth-order interaction effect can also be considered in the model. In the FAF processing, a "grand mean" e0 is also introduced into the model, which serves as a scalar factor (e.g., the difference between measuring weight in tons or ounces). With this, a fully saturated model for U is shown as:

U(o,d,c,m) = e0 * e[O](o) * e[D](d) * e[C](c) * e[M](m) * e[OD](o,d) * e[OC](o,c) *
e[OM](o,m) * e[DC](d,c) * e[DM](d,m) * e[CM](c,m) * e[ODC](o,d,c) * e[ODM](o,d,m) * e[OCM](o,c,m) * e[DCM](d,c,m) * e[ODCM](o,d,c,m). [4]

To explain the internal pattern within the flow matrix U, the task is to disentangle individual interaction effects, to see which are strong and which irrelevant (near 1). Knowing the pattern, values for any missing cell can be estimated by multiplying through the individual effects that supposedly comprise it.

4.2.2 Estimation of Effects

To determine whether there is some other set of effects that is superior for the FAF purposes, a set of effects that minimizes the informational content of the model, ∑ (e * ln e), summed over every effect in every level (that is, every model parameter), is selected. Roughly speaking, the goal is to find a set of effects that are as close to 1 as possible, and minimize the number of effects significantly different than 1. This is done by concentrating the variation (deviation from 1 = "no effect") found in high-order effects into a low-order effects matrix, reducing the deviation in a large number of cells in exchange for increasing deviations in a small number.

The FAF solution method starts with e[ODCM] = U, and cyclically finds variation that can be removed from a high-level matrix and passed to a low-order matrix, repeating the process until there is no more variation to be extracted. The extraction process proceeds from 3-dimensional effects into 2-dimensional, and then into 1-dimensional, and then into the 0-dimensional grand mean. The extraction cycle repeats from 4- to 3-dimsional effects matrices, until there is no more movement of effects parameters.

If a cell in e[ODCM] is unknown, or zero-valued, it will not participate in the extraction process, and geometric means will be taken only from those known cells. If all high-order cells are unknown, then an "unknown" will be passed down to the next level. In principle, cells could be called true zeros if there was no CFS activity there. In the FAF processing, they are referred to as "unknown" instead, in case someday it is desired to use the effects matrix to indicate probabilities of movement rather than measured CFS movements.

4.2.3 A Priori Estimation of Low-Order Effects

The CFS also provided lower-dimensional marginal tables that have less suppression (i.e., fewer suppressed cells). Estimates of lower-order effects can be made from these marginal tables and inserted into a lower-order effects matrix before the extraction process starts. For instance, a 2-dimensional origin-destination table exists, where every (o,d) cell has been summed over all commodities and modes. That table can be taken as an initial estimate of the OD effects matrix e[OD]. It is generally convenient to normalize these matrices by their geometric means. At every step, the equality between the product of effects and measured flows must be preserved, which means that, if a priori low-order effect is inserted, upstream next higher order effects must have their values (if known) divided by the same amount to preserve the equality. Real zeros in the 2-dimensional OM, DM, OC, and CM marginal CFS tables were accepted as true zeros. However, a zero in the OD cells was treated as a sampling zero, which did not preclude the possibility of such a movement in reality. Here, a sampling zero has the same practical effect as a missing value or suppression. In the final IPF step, “impossible” cells will be converted to absolute zeros, since the CFS controls are zeros.

4.2.4 Alternate Sources and Years

There are many cases where the 2012 CFS has sampling zeros or suppressions, but where an earlier CFS (i.e., 2002, 2007) had positive levels of movement. If any region is composed entirely of "unknowns" in the 2012 matrix, it will be impossible to extract a pattern. However, the previous CFS may supply one, which can be passed down the extraction chain. It also allows the detection of major pattern changes between successive CFS's. Because of differences in geographic zones (as discussed in Section 2 of this report), an equivalence table between different years' zones had to be manually established. In the case that no equivalence could be identified, the earlier year's zone had to be ignored. Note that this process (i.e., domestic CFS) ignores differences in mode and commodity definitions.

The 2012 rail Waybill Sample was also used as an alternate source, using a STCC to SCTG mapping, and converting county origins and destinations into FAF4 zones, while leaving shipment values unknown. Container shipments were excluded, so the sole mode involved was rail. As always, known values in the 2012 CFS are preserved, but unknown values may be imputed by a multiplication of effects estimated from other sources.

So far, the discussions have been on processes revolved around measuring tons. Of course, FAF also estimates dollar values of shipments. This was handled with a similar model formation, except for adding another dimension for the activity type (V), with two levels: tons and dollars. An interaction effect of V with each of the other dimensions was included into the model.

4.2.5 Computation

Although this is a multiplicative and not additive model, and the interest is in geometric means for minimizing variation, for practical purposes all values are converted into natural logs. This is because finding the arithmetic average of logs is much easier than calculating a geometric mean. This computational convenience is the sole reason for calling this process a "log-linear model." At the conclusion, logs are converted back to real numbers, and missing values in the final matrix for the 2012 CFS are replaced by a product of effects. That matrix then goes to the IPF stage for process.

4.2.6 IPF for CFS Processing

The marginal totals of the CFS form a set of control totals that the activity matrix U must conform to them. In addition, there are some state-level controls where summations over the contiguous zones that form the state should be matched. Note that many cells in the original CFS matrix either have absolute values in them, or else have absolute zeros due to a zero sample count, and those are controls as well. These marginal controls were provided by Census in a special CFS tabulation for domestic shipments only.

For every control value (every non-missing cell in a control matrix), the values in the cells of the disaggregate table that compose the aggregate cell are summed and compared to the control. If different, all the component cells are adjusted up or down by a common factor to match the control. Since CFS values are rounded to the nearest integer (in kilotons or million-dollars), a total that is within half of a unit is considered as a match, thus no need for further adjustments. For intermodal movements, several CFS modes must be summed to match the category. When some of the component modes have values, and others are missing, the values form a floor for FAF values, and exceeding the floor does not require adjustment.

This IPF cycle through controls is repeated until there are no more significant changes in the U cell values between subsequent iterations.

USA Banner

Estimation of Domestic CFS Shipments

4.1 OVERVIEW

4.2 ESTIMATION PROCESS