Survey Methodology Appendix B. Using SUDAAN and Other Software for the Analysis of the 2002 National
Transportation Availability and Use Survey
Using SUDAAN and Other Software for the Analysis of the 2002 National Transportation
Availability and Use Survey
Variance
estimation procedures have been developed to account for complex sample
designs. Using these procedures, factors such as the selection of the sample,
the use of differential sampling rates to subsample a subpopulation and
nonresponse adjustments can be appropriately reflected in estimates of sampling
error. The two main methods for estimating variances from a complex survey are
known as Taylor series variance estimation (linear approximation)
and replication (including jackknife and balanced repeated replication (BRR)
methods). Wolter (1985) is a useful reference on the theory and applications of
these methods. Shao (1996) is a more recent review paper that compares these
methods.
Standard
statistical software packages that assume a simple random sampling design do
not properly compute variance estimates from weighted data collected under a
design other than simple random sampling. By properly using the variable,
RAKEDW00, as the final full sample weighting factor in standard statistical
programs, an analysis of the survey data will result in accurate point
estimates; however, this will not result in accurate variance estimates.
To overcome
this limitation, this document gives guidance for analyzing the survey data
using the software package SUDAANã (Software for the
Statistical Analysis of Correlated Data) based on the Taylor series and replication
methods (Research Triangle Institute, 1997). SUDAAN is a statistical package
developed by Research Triangle Institute (RTI) to analyze data from complex
sample surveys. SUDAAN computes the standard errors of the estimates taking the
survey design into account. While later versions of SUDAAN (version 8 or later)
can use replication methods, it is most often used for computing variances
based on the first-order Taylor series approximation also
known as linearization. Though this section only provides details on the use of
SUDAAN, the software packages of STATA and WesVar also can be used for linear
approximation and replication methods respectively.
Although
SUDAAN's estimates of variance based on linearization take into account the
sample design of the survey; they do not properly reflect the variance
reduction due to raking and poststratification. The weights in this survey were
raked to control totals in the final step of the weighting process. Replication
methods are more appropriate to compute estimates of variance under this
condition. However, the magnitude of the reduction will depend on the type of
estimate (i.e. total, proportion, etc.) and the correlation between the
variable being analyzed and the dimensions used in raking.
Analysis of the Survey Data Using SUDAAN
This section
describes how to use SUDAAN using both Taylor series and replication
methods for the analysis of the survey data and the computation of appropriate
standard errors and shows which options are appropriate to use. The data file
contains 5,019 records, one for every completed extended interview.
I. Using Taylor Series Linear Approximation (SUDAAN and STATA)
Required Variables
The variables that provide
information about the sample design in SUDAAN are:
Variable TSVUNIT (Taylor’s series variance unit). The
variable TSVUNIT indicates the primary sampling unit (PSU) to be used for computing
the estimates of variance using the Taylor series method. In the survey, the
PSU corresponds to the household.
Variable RAKEDW00 (final full sample weight). The variable
RAKEDW00 contains the final weight for the full sample. This weight is positive
for all the records.
SUDAAN Keywords
The statements and keywords
needed to run SUDAAN to compute variance estimates based on the Taylor Series
approximation are:
DESIGN=WR (required). The sample was drawn without replacement;
however, the WR (with replacement) design option is used because the finite
population correction factor (fpc) is negligible. (Note: STRWR
is not used because this requires that each record be a PSU, which is not the
case because two persons could be sampled from the same household.)
NEST TSVUNIT /PSULEV = 1 (required). The keyword NEST
lists the variables whose values identify the sampling stages. The Option /PSULEV
= 1 instructs SUDAAN that TSVUNIT is the PSU level variable in position
1 in the NEST statement.
WEIGHT RAKEDW00 (required). The keyword WEIGHT lists
the final weight to be used in the analysis. In this case, the variable for
the weight is the final full sample weight RAKEDW00.
The variable TSVSTR in
combination with the variable TSVUNIT can also be used to compute the standard
errors with the appropriate changes in the NEST statement. The variable TSVSTR
indicates the sampling stratum. In the survey, TSVSTR is set to 1 for all the
records. An example of the use of this variable is also included in the
following section.
SUDAAN is not the only
statistical software that can be used to generate approximate standard errors
using linear approximation. The
statistical software STATA can be used as well.
The variables TSVUNIT and TSVSTR can be used as the nesting variables
and RAKEDW00 as the full sample weight in STATA to correctly generate both
point estimates and standard errors.
II. Using Jacknife Replication Methods (SUDAAN and WesVar)
The additional statements
and keywords needed to run SUDAAN to compute estimates of variance based on
replication methods are:
DESIGN= JACKKNIFE (required). The survey data file includes
replicate weights that can be used in SUDAAN. The replication method used to
create the weights is a form of the jackknife method. If estimates of variance
based on replication methods are computed, the option JACKKNIFE should
be used in the design statement.
JACKWGTS RAKEDW01 - RAKEDW80 / ADJJACK=1 (required).
The keyword JACKWGTS followed by the list of the variable names for the
80 replicate weights created for the survey (RAKEDW01-RAKEDW80). When computing
variances, replicate based estimates need to be adjusted by a constant value
c that depends on the replication method used. In the replicates for
this survey, the value of c is 1 and SUDAAN adjusts the weights appropriately
with the option ADJJACK=1.
WesVar can be
used to generate point estimates and appropriate standard errors using
replication methods as well. This
dataset contains 80 replicates (RAKEDW01-RAKEDW80) for the full sample weight
RAKEDW00. These replicates should be
included in the file when creating the WesVar dataset. The jackknife method of JK2 should be
selected as the jackknife method to be used.
The ID variable on this file is PERSID.
Estimates Using SUDAAN based on the Taylor Series approximation
Listing 1 shows an example of running SUDAAN’s PROC CROSSTAB to
compute totals, percentages and standard errors for the variable GENDER[19]
based on the Taylor Series approximation. The procedure CROSSTAB produces
weighted frequencies and percentage distributions for categorical variables.
The following statements were used to produce the output in Listing 1.
proc crosstab data = btsall design=WR ;
weight RAKEDWØØ ;
NEST TSVUNIT /PSULEV=1 ;
subgroup gender ;
levels 2;
setenv colwidth = 17 decwidth= 3 ;
run ;
The following statements also produce the same output as Listing 1. The
difference is the use of the variable TSVSTR in the NEST statement.
proc crosstab data = btsall design=WR ;
weight RAKEDWØØ ;
NEST TSVSTR TSVUNIT;
subgroup gender ;
levels 2;
setenv colwidth = 17 decwidth= 3 ;
run ;
Listing 1.
Sample PROC CROSSTAB Output of Marginal Tools, Percentages, and Standard
Errors*
Date: 12-12-2002 Research Triangle
Institute
Page : 1
Time: 11:31:59 The CROSSTAB Procedure
Table : 1
Variance Estimation Method: Taylor Series (WR)
by: WHAT IS YOUR/SUBJECT'S GENDER.
| Sample Size |
5011.000 |
2322.000 |
2689.000 |
| Weighted Size |
273335024.970 |
133394837.990 |
139940186.980 |
| SE Weighted |
3826319.579 |
3328823.884 |
3319195.188 |
| Row Percent |
100.000 |
48.803 |
51.197 |
| Col Percent |
100.000 |
48.803 |
51.197 |
| Tot Percent |
100.000 |
48.803 |
51.197 |
| SE Row Percent |
0.000 |
0.995 |
0.995 |
| SE Col Percent |
0.000 |
0.995 |
0.995 |
| SE Tot Percent |
0.000 |
0.995 |
0.995 |
*The standard errors of both
the estimated totals and percentages in Listing 1 are much larger than standard
errors that take raking into account. This is because the effect of raking
cannot be accounted for in PROC CROSSTAB when using Taylor series linearization.
Listing 2
shows an example of running SUDAAN’s PROC DESCRIPT to compute means, and
standard errors for the variable AGE[20]
based on the Taylor Series approximation. The procedure DESCRIPT produces
weighted totals and means and their standard errors for continuous variables.
The following statements were used to produce the output in Listing 2.
PROC DESCRIPT DATA = btsall design = WR ;
WEIGHT RAKEDW00 ;
NEST TSVUNIT /PSULEV=1 ;
VAR AGE ;
setenv colwidth = 17 decwidth= 3 ;
print / style = nchs ;
run ;
Listing 2.
Sample PROC DESCRIPT Output of Means and Standard Errors
S U D A A N
Software for the Statistical Analysis of Correlated Data Copyright
Research Triangle Institute
July 2001
Release 8.0.0
Date: 12-12-2002 Research Triangle Institute
Page : 1
Time: 11:32:24 The
DESCRIPT Procedure
Table : 1
Variance Estimation Method: Taylor Series (WR)
by: Variable, One.
| AGE AT SCREENER 1 |
4952.000 |
269936641.060 |
9544546622.010 |
35.358 |
0.423 |
Estimates Using SUDAAN based on replication
Listing 3
shows an example of running SUDAAN’s PROC CROSSTAB to compute totals,
percentages and standard errors for the variable GENDER[21]
based on replication. The standard errors are smaller that those in Listing 1
because replication methods can reflect the reduction in variance caused by
raking. The survey weights were raked to five dimensions in the last step of
weighting. For GENDER, the standard errors are much smaller (in particular for
totals) because GENDER was used to create one of the raking dimensions. The
following statements were used to produce the output in Listing 3.
proc crosstab data = btsall design=JACKKNIFE;
weight RAKEDW00 ;
JACKWGTS RAKEDW01-RAKEDW80 /ADJJACK=1;
subgroup gender ;
levels 2;
setenv colwidth = 17 decwidth= 3 ;
run ;
Listing 3.
Sample PROC CROSSTAB Output of Marginal Tools, Percentages, and Standard
Errors
S U D A A N
Software for the Statistical Analysis of Correlated Data Copyright
Research Triangle Institute
July 200
Release 8.0.0
Number of observations read: 5019 Weighted count :273643273
Denominator degrees of freedom : 80
Date: 01-08-2003 Research Triangle Institute
Time: 13:00:12 The CROSSTAB
Procedure
Variance Estimation Method: Replicate Weight Jackknife
by: WHAT IS YOUR/SUBJECT'S GENDER.
| Sample Size |
5011.000 |
2322.000 |
2689.000 |
| Weighted Size |
273335024.970 |
133394837.990 |
139940186.980 |
| SE Weighted |
129773.082 |
83463.088 |
95960.303 |
| Row Percent |
100.000 |
48.803 |
51.197 |
| Col Percent |
100.000 |
48.803 |
51.197 |
| Tot Percent |
100.000 |
48.803 |
51.197 |
| SE Row Percent |
0.000 |
0.023 |
0.023 |
| SE Col Percent |
0.000 |
0.023 |
0.023 |
| SE Tot Percent |
0.000 |
0.023 |
0.023 |
Listing 4
shows an example of running SUDAAN’s PROC DESCRIPT to compute means, and
standard errors for the variable AGE[22]
based on replication. The following statements were used to produce the output
in Listing 4.
PROC DESCRIPT DATA = btsall design = JACKKNIFE ;
WEIGHT RAKEDW00 ;
JACKWGTS RAKEDW01-RAKEDW80 /ADJJACK=1;
VAR AGE ;
setenv colwidth = 17 decwidth= 3 ;
print / style = nchs ;
run ;
Listing 4.
Sample PROC DESCRIPT Output of Means and Standard Errors
Date: 01-08-2003 Research Triangle Institute
Page : 1
Time: 13:26:21 The DESCRIPT
Procedure
Table : 1
Variance Estimation Method: Replicate Weight Jackknife
by: Variable, One.
| AGE AT SCREENER 1 |
4952.000 |
269936641.060 |
9544546622.010 |
35.358 |
0.081 |
References
Shao, J.
(1996). Resampling Methods in Sample Surveys, (with
Discussion). Statistics, 27,
203-254.
Wolter, K. (1985). Introduction
to Variance Estimation. New York:
Springer-Verlag.
Research Triangle Institute. (1997). SUDAAN® user’s manual, (Release
7.5). Research Triangle Park: Author.
|