Monday, 16 March 2015

Basics of SAS Programming


  • Basic Rules:
    • All SAS programs consist of a sequence of statements organized into "steps". There are only two kinds of steps:
      • DATA step:
      • PROC step:
        Note:Once a dataset has been created, it can be processed by any subsequent "DATA" or "PROC" step.
    • A SAS program can contain any number of "DATA" and "PROC" steps.
    • All SAS statements start with a keyword (DATA, INPUT, PROC, etc.).
    • All SAS statements end with a semicolon ";".
    • SAS statements can be entered in free-format. That is, they can begin in any column; there may be multiple statements per line; you may split a statement over several lines (as long as no word is split.).
    • Uppercase and lowercase are equivalent, except inside quote marks (lang = 'c'; is not the same as lang = 'C';).
    • Naming Convention:
    • 32 characters in length (Note: v6.12 users: 8 characters in length)
      1. begin with A-Z or _(i.e. underscore)
      2. cannot contain blanks or special symbols (e.g., &, %, $, #, etc.)
    • SAS Types:
      1. Character variables (followed by $)
      2. Numeric variables

Note:Missing data: represented by '.' for numeric variables; by ' ' (i.e. space) for character variables.
  • General statements:
These are statements that do not belong to a particular DATA or PROC step. They have a global effect.
    • footnote statement
    • title statement
    • options statement
  • DATA step statements:
SAS carries out all statements in the DATA step in order for each input observation.
    1. The "DATA" statement identifies the start of a DATA step and names the data structure to be created.
   DATA <dataset_name>;

    1. The "INFILE" statement identifies an external file from which data will be read.

   INFILE <file_name>;

    1. The "DATALINES" statement is used if you choose to embed data in your program instead of reading it from an external file.

   DATALINES;

    1. The "INPUT" statement defines variables, their type, and specifies how data is to be read:
    INPUT <variable_name [type] [position]>;


         Example:

    i)   INPUT MFG @@;

    ii)  INPUT MFG $ TYPE $ SEEK TRANSFER;

    iii) INPUT MFG $ 1-8 TYPE $ 11-12 SEEK 13-16 TRANSFER 17-19;

    1. Variables may be given descriptive names:

6.      LABEL <variable_name='label'>...;

7.   

8.           Example:

9.      i)  LABEL MFG='Manufacturer';

10.   ii) LABEL MFG='Manufacturer' 

11.                 SEEK='Seek Time';

    1. Assignment statements create new variables. The usual arithmetic operations are available:

   Symbol Operation             Example

   **     Exponentiation        Z=X**2;

   *      Multiplication        Z=X*Y;

   /      Division              Z=X/Y;

   +      Addition              Z=X+Y;

   -      Subtraction           Z=X-Y;


    1. The "DROP" and "KEEP" statements are used to remove variables and all associated values from a dataset:

   DROP <variable_name>...;

removes named variables from the dataset and keeps unnamed variables.

   KEEP <variable_name>...;

keeps named variables and drops unnamed variables from the dataset.

    1. The "IF" statement is used for conditional processing:

   IF <expression> THEN <statement>;

   ELSE <statement>;

Note: The ELSE statement is optional. The IF ... THEN parts comprise a single statement:

   i)  IF SEEK < 15 THEN CLASS = 'FAST';

       ELSE CLASS = 'SLOW';

   ii) CLASS='SLOW';

       IF SEEK < 15 THEN CLASS = 'FAST';

SAS comparison operators are shown below. You can use either the symbol or the two-letter abbreviation.

   Symbol     Abbrev

   <, <=      LT, LE 

   >, >=      GT, GE 

   =, ^=      EQ, NE 

A special form of the "IF" statement is used for subsetting a dataset, that is selecting/excluding particular observations.

   DATA CDROM;

   INPUT MFG $ TYPE $ SEEK TRANSFER;

   IF SEEK < 15;

The statement IF SEEK < 15; is equivalent to:

i)  IF SEEK < 15 THEN OUTPUT;

ii) IF SEEK >=15 THEN DELETE;

    1. Comments:
      Two types of comments:

   i)      * ... ;

   ii)     /* ... */

  

   DATA CDROM;

   * Read in variables;

   INPUT MFG $ TYPE $ TRANSFER SEEK;

   /*  ignore next statement

   SEEKMIN = SEEK/60000;

   */


The example below reads CDROM data and creates additional variables:

  DATA CDROM;

  INPUT MFG $ TYPE $ SEEK TRANSFER;

  IF SEEK < 15 THEN CLASS='FAST';

  ELSE CLASS='SLOW';

        DROP MFG TYPE;

  DATALINES;

  NEC 12X 7.3 105

  SONY 6X 23.1 830

  SONY 4X 40.1 330

  CANON 6X 13.5 530

  SONY 12X 5.5 1000

The resulting dataset will contain observations 1, 4 and 5 and will look like:.

   7.3  105

  13.5  530

   5.5 1000

  • PROC step statements:

SAS procedures execute predefined procedures which may be either statistical or utility procedures. The data structure processed is the most recently created dataset unless otherwise specified in a "DATA=" option.

  PROC <procedure_name>;

  [procedure_statement];


"procedure_statement" typically depends on the procedure but some may be used with all procedures:

   VAR <variable_name>;


Indicates which variables are to be analyzed. If this statement is omitted, the default is to include all variables of the appropriate type.

   BY <variable_name>;


Used primarily with the SORT procedure in which case it cannot be omitted.
You should be familiar with the following procedures:

    1. Correlations among a set of variables:

   PROC CORR [options];

     [VAR <variable_name>;]

    1. Means, standard deviations, and other univariate statistics (N; MEAN; STD; MIN; MAX; SUM; VAR):

   PROC MEANS [options];

    1. Univariate statistics. That is means, standard deviations, median etc. Also provides options to generate a p-value for a normality test and to produce the box plot, stem & leaf and normal plots:

   PROC UNIVARIATE [options];

     [VAR <variable_name>...;]

    1. Print a SAS data set:

   PROC PRINT [options];

     [VAR <variable_name>...;]

    1. Sort a SAS data set according to one or more variables:

   PROC SORT [options];

     BY <variable_name>...;

    1. Plot y aginst x. May be used to create a scatter plot or a residual plot:

   PROC PLOT [options];

     PLOT <dep_var_name>*<indep_var_name>='*' [options];

No comments:

Post a Comment