28 September 2000
SAS Macro CVLR (Cross-Validation for Logistic Regression)
Written by Clint Moore, University of Georgia, Athens
Download Instructions
1. Click on "Obtain Code" link below.
2. Click File, Save As... to start the download
3. Save the file as CVLR.SAS
Obtain
Code
Macro use (values for required parameters displayed in CAPITAL letters)
%cvlr(
data = _NAME_OF_INPUT_DATASET_ ,
xvars = _LIST_OF_INDEPENDENT_VARIABLES_ ,
outcome = _NAME_OF_RESPONSE_VARIABLE_ ,
reps = _number_of_resampling_iterations_ ,
pvalid = _proportion_of_data_in_validation_set_ ,
rndseed = _seed_for_random_number_generator_ ,
out = _output_file_of_results_ ,
weight = _name_of_weighting_variable_ ,
print = _toggle_for_printed_output_ ,
title = _title_for_output_ ,
itermax = _maximum_number_of_solution_iterations_per_model_
)
Introduction
CVLR is a SAS macro for conducting Monte Carlo cross-validation for a given logistic regression model. The macro is written primarily in the SAS IML language, so the SAS/IML product must be installed for this macro to work.
The user provides CVLR a set of data and a model that relates a binary outcome variable to a group of numeric explanatory variables. CVLR then applies resampling methodology to estimate the model's rate of classification accuracy. CVLR is an iterative procedure. In each iteration, the data are randomly split into 2 subsets. The model is fit to the first subset of data and is tested on the other subset. The macro calculates the proportion of observations for which the model prediction agrees with the binary outcome, and a new iteration is started. At the end of the the iterations, the proportions of observations correctly predicted are averaged, and the result is reported as the estimated classification accuracy of the model.
Macro parameters
CVLR has 11 invocation parameters, but the user needs to specify only 3 for the macro to run:
%cvlr(data=trials, xvars=p treatment z2, outcome=survived)
In this example, CVLR will read data found in the SAS dataset trials. The variables p, treatment, and z2 represent continuous or discrete numeric explanatory variables of a logistic regression model. The variable survived represents a 0-1 binary response variable. CVLR then conducts a cross-validation analysis for these data and this model.
The macro will print a single report in the output window. The report summarizes the total number of iterations, the number of iterations for which the model was successfully fit, and the average rate of classification accuracy (and its standard error).
Values for eight additional parameters may be specified by the user. Three parameters control the design of the resampling process. The first is reps, which specifies the number of cross-validation iterations. Its default value is 20. Ideally, a cross-validation analysis should be based on thousands of iterations, but with a large dataset, it may be best to start experimentally with a few iterations. The second design parameter is pvalid, which specifies the proportion of the dataset to be used as the validation (testing) subset in each iteration. The default value of pvalid is 0.5, which means that half of the data will be used to fit the model and the remaining half will be used to test the model. The third parameter, rndseed, provides a starting seed to the random number generator. The default value is 0, which means that the computer's clock will be used to provide a seed. In this case, a different outcome may be observed over multiple runs for the same problem. Use a non-zero, positive integer seed if it is important that the cross-validation run produces a result that can be repeated later.
By default, CVLR produces only a printed summary and no output dataset, but the output can be saved to a dataset of the user's choice through the out parameter. For example, to write the output to the temporary dataset "cvout", specify out=cvout.
CVLR presumes that 1 record in the input dataset represents 1 real observation. If, however, each record represents several replicated observations, all with the same values of the explanatory variables and outcome variable, and if the number of replicates is provided in an additional variable, say count, then this information can be provided to the CVLR macro through the weight parameter. In this case, specifying weight=count appropriately weights each record by the number of replicate samples. Note that the splitting of the dataset into model-fitting and validation subsets is done strictly on a per-record basis. If each record represents multiple observations, all observations corresponding to a single record will be placed into either the model-fitting or validation subset, depending on the subset assignment of the record. In this case, the subsets may vary in size in terms of real observations between iterations, even though the number of records remains constant. If it is important that data splitting instead be conducted on a per-observation basis, then the user should create a new input dataset in which records represent single observations and run an unweighted analysis in CVLR. Either exclude the weight parameter, or set weight=0 (the default) for an unweighted analysis.
Two parameters control aspects of the printed output. The printed report may be suppressed by specifying print=0 (print=1 by default). The default title for the printed output is "CVLR Results", but the user may change this through the title parameter. For example, CVLR will print "My Output" on the report if the user specifies title=My Output (note: do not use quotes).
The last optional parameter is the itermax parameter. CVLR estimates the parameters of the logistic regression model through an iteratively reweighted least squares algorithm. The itermax parameter controls the number of iterations of this algorithm. The default value of itermax is 20, which should be sufficient for most applications, but may be too small for some large or unstable models.
Limitations
CVLR performs only the kind of resampling cross-validation described above. CVLR is currently unable to conduct jackknife or bootstrap cross-validation. Though it is intuitively reasonable to randomly assign observations into model-fitting and testing subsets, some authors justify alternative allocation schemes on theoretical grounds (Picard and Cook 1984, Davison and Hinkley 1997:294-295).
CVLR accommodates only numeric variables as explanatory variables, and these are assumed to represent continuous covariates; the macro is currently unable to treat variables as categorical predictors. However, for some simple cases, the user may work around this limitation by using "dummy-variable" coding of the categorical variables. Rawlings (1988) and other similar works are good reference sources for dummy-variable coding of categorical effects.
The value 0.5 is used as the cutpoint for assessing classification accuracy of the model. In the future, this value will be controllable through a user-set parameter but currently is not. However, two lines of programming in the classify module of CVLR control these calculations:
pred0 = (expy<0.5);
pred1 = (expy>=0.5);
By changing the 0.5 value in both lines, the user may control the threshold above which the model's prediction is interpreted as a "success" (i.e., the binary outcome "1").
Lastly, the macro accommodates only binary response values, always coded as "0" or "1". Also, an intercept term is forced into the model.
Setting up and invoking the macro
The macro code can be directly pasted into any SAS program, but a more efficient alternative is to save the macro in a directory and invoke it through a SAS %include statement at the top of the application program, e.g.,
%include ‘mydirectory_and_filename';
... other SAS programming statements
%cvlr(data=indata,xvars=a b c,outcome=y)
where mydirectory_and_filename is the full path and file name of the CVLR program. But perhaps the best way to set up the macro is through the macro autocall library. The user places the CVLR macro in a directory, perhaps along with other macros. Then the user issues the statement
options mautosource sasautos=‘autocall_location';
where autocall_location
is the full path name of the directory where CVLR is stored. This
statement can be placed in the AUTOEXEC.SAS file so that it can be executed
every time on startup of SAS.
References
Davison, A. C., and D. V. Hinkley. 1997.
Bootstrap methods and their application. Cambridge Univ. Press, Cambridge,
UK.
Picard, R. R., and R. D. Cook. 1984.
Cross-validation of regression models. J. Am. Stat. Assoc. 79:575-583.
Rawlings, J. O. 1988. Applied regression
analysis. Wadsworth and Brooks/Cole, Pacific Grove, Cal.
Clint Moore
Georgia Cooperative Fish and Wildlife Research Unit
Warnell School of Forest Resources
University of Georgia
Athens, GA 30602 USA
(706) 542-3900
cmoore@smokey.forestry.uga.edu