Is there an easier way to get the number of variables in an SPSS file than OMS or Python scripting?-CodePudding

I have been trying my brain and googling all of the place to figure out if there is an easy way to get the number of variables in a datafile purely using syntax. The reason for this is that I work with a lot of small files to which I need to merge new data. However, if you do this programatically, there is a chance due to wrongly types variable names, a new variable will be added to the dataset if it does not already exist.

Hence I want to know the number of variables before a merge and after a merge. For this I have tried to write a macro, but SPSS macro's are really terrible at arimetric and you cannot seem to feed these macro's with the keyword "ALL" (IE: all variables).

Additionally I have looked for syntax which exports info about the dataset to the output where the number of variables will be directly shown. As far as I could find, this did not exist.

So I have in fact managed to make it work via OMS (output management system), but the way is super convoluted and is a lot of lines of text per datafile. See below:

* Create datafile to check if merge went okay.
DATA LIST LIST /FileName(A50) N_PRE(F8) N_POST(F8).
DATASET NAME CheckList.

* Open basefile.
GET FILE='DATAFILE_BASE.sav'.
DATASET NAME Data WINDOW=FRONT.

* Set settings to show variable names instead of variable labels in output.
SET Small=0.0001 THREADS=AUTO TVars=Names OVars=Labels TNumbers=Labels ONumbers=Labels DIGITGROUPING=No LEADZERO=No ODISPLAY=tables.

* Use OMS to determine number of variables.
DATASET DECLARE COUNT.
OMS
  /SELECT TABLES
  /IF COMMANDS=['Frequencies'] SUBTYPES=['Frequencies']
  /DESTINATION FORMAT=SAV NUMBERED=TableNumber_
   OUTFILE='COUNT' VIEWER=YES
  /TAG='Frequencies'.
FREQUENCIES ALL.
OMSEND TAG = ['Frequencies'].

* Retain only variable names and remove duplicates.
DATASET ACTIVATE COUNT.
ADD FILES FILE *
    /KEEP Label_,.
EXECUTE.
SORT CASES BY Label_(A).
MATCH FILES
  /FILE=*
  /BY Label_
  /FIRST=PrimaryFirst
  /LAST=PrimaryLast.
DO IF (PrimaryFirst).
COMPUTE  MatchSequence=1-PrimaryLast.
ELSE.
COMPUTE  MatchSequence=MatchSequence 1.
END IF.
LEAVE  MatchSequence.
FORMATS  MatchSequence (f7).
COMPUTE  InDupGrp=MatchSequence>0.
SORT CASES InDupGrp(D).
MATCH FILES
  /FILE=*
  /DROP=PrimaryFirst InDupGrp MatchSequence.
SELECT IF (PrimaryLast=1).
EXECUTE.

* Count variables and use OMS again to determine max.
COMPUTE VarCount = $CASENUM.
DATASET DECLARE PRECOUNT.
OMS
  /SELECT TABLES
  /IF COMMANDS=['Descriptives'] SUBTYPES=['Descriptive Statistics']
  /DESTINATION FORMAT=SAV NUMBERED=TableNumber_
   OUTFILE='PRECOUNT' VIEWER=YES
  /TAG='Descriptives'.
DESCRIPTIVES VARIABLES=VarCount
  /STATISTICS=MAX.
OMSEND TAG = ['Descriptives'].
DATASET ACTIVATE PRECOUNT.

* Reduce to one line, cleanup and add identifiers.
SELECT IF ~SYSMIS(Maximum).
STRING FileName (A50).
COMPUTE FileName = 'FILENAME.SAV'.
RENAME VARIABLES (N = N_PRE).
ADD FILES FILE *
    /KEEP FileName N_PRE.
EXECUTE.
DATASET CLOSE COUNT.

* Merge data.
GET FILE='DATAFILE_NEW.sav'.
DATASET NAME MergeData WINDOW=FRONT.
DATASET ACTIVATE Data.
ADD FILES /FILE=*
  /FILE='MergeData'.
EXECUTE.
DATASET CLOSE MergeData.

* Do another OMS run to check post_N.
DATASET DECLARE COUNT.
OMS
  /SELECT TABLES
  /IF COMMANDS=['Frequencies'] SUBTYPES=['Frequencies']
  /DESTINATION FORMAT=SAV NUMBERED=TableNumber_
   OUTFILE='COUNT' VIEWER=YES
  /TAG='Frequencies'.
FREQUENCIES ALL.
OMSEND TAG = ['Frequencies'].

* Retain only variable names and remove duplicates.
DATASET ACTIVATE COUNT.
ADD FILES FILE *
    /KEEP Label_,.
EXECUTE.
SORT CASES BY Label_(A).
MATCH FILES
  /FILE=*
  /BY Label_
  /FIRST=PrimaryFirst
  /LAST=PrimaryLast.
DO IF (PrimaryFirst).
COMPUTE  MatchSequence=1-PrimaryLast.
ELSE.
COMPUTE  MatchSequence=MatchSequence 1.
END IF.
LEAVE  MatchSequence.
FORMATS  MatchSequence (f7).
COMPUTE  InDupGrp=MatchSequence>0.
SORT CASES InDupGrp(D).
MATCH FILES
  /FILE=*
  /DROP=PrimaryFirst InDupGrp MatchSequence.
SELECT IF (PrimaryLast=1).
EXECUTE.

* Count variables and use OMS again to determine max.
COMPUTE VarCount = $CASENUM.
DATASET DECLARE POSTCOUNT.
OMS
  /SELECT TABLES
  /IF COMMANDS=['Descriptives'] SUBTYPES=['Descriptive Statistics']
  /DESTINATION FORMAT=SAV NUMBERED=TableNumber_
   OUTFILE='POSTCOUNT' VIEWER=YES
  /TAG='Descriptives'.
DESCRIPTIVES VARIABLES=VarCount
  /STATISTICS=MAX.
OMSEND TAG = ['Descriptives'].
DATASET ACTIVATE POSTCOUNT.

* Reduce to one line, cleanup and add identifiers.
SELECT IF ~SYSMIS(Maximum).
STRING FileName (A50).
COMPUTE FileName = 'FILENAME.SAV'.
RENAME VARIABLES (N = N_POST).
ADD FILES FILE *
    /KEEP FileName N_POST.
EXECUTE.
DATASET CLOSE COUNT.
        
* Merge the post and precount and add to checklist.
DATASET ACTIVATE PRECOUNT.
MATCH FILES /FILE=*
  /TABLE='POSTCOUNT'
  /BY FileName.
EXECUTE.
DATASET ACTIVATE CheckList.
ADD FILES /FILE=*
  /FILE='PRECOUNT'.
EXECUTE.
DATASET CLOSE PRECOUNT.
DATASET CLOSE POSTCOUNT.

This does what I want. Get a pre-merge measurement, a post-merge measurement, links them together and adds them to a predifined checklist, which is processed at the end (simple post_n minus pre_n caluclation to show which files are off). But we're talking about 50-100 small datasets to be done this way, which is 150 lines-ish per dataset.

The syntax is generated through matlab with some specific inputs from a variable database, so writing it is not the problem. It's just a convoluted mess. Any ideas to streamline this code?

Regards.

EDIT: Thanks to @eli-k for supplying a much more elegant solution (the whole syntax previously took 4-5 minutes to run, this is much quicker, since it doesn't have to do a FREQUENCIES command on a big dataset every iteration).

I updated the macro a little bit to allow for some extra customization (and to allow a pre and post merge execution).

DEFINE !countVars (outputvar = !TOKENS(1) 
/datasetname = !TOKENS(1))

* Figure out number of variables from Dictionary.
DATASET DECLARE  tmp.
OMS /SELECT TABLES /IF COMMANDS=['File Information'] SUBTYPES=['Variable Information'] 
    /DESTINATION FORMAT=SAV OUTFILE='tmp' VIEWER=NO.
DISPLAY DICTIONARY.
OMSEND.
DATASET DECLARE !datasetname.
DATASET ACTIVATE tmp.
OMS
  /SELECT TABLES /IF COMMANDS=['Frequencies'] SUBTYPES=['Frequencies']
  /DESTINATION FORMAT=SAV OUTFILE=!datasetname VIEWER=NO
  /TAG='Frequencies'.
FREQ COMMAND_.
OMSEND TAG = ['Frequencies'].
DATASET ACTIVATE !datasetname.
DATASET CLOSE tmp.
RENAME VARIABLES (Frequency = !outputvar).
ADD FILES FILE * /KEEP !outputvar.
EXECUTE.
!ENDDEFINE.

!countVars outputvar=N_PRE datasetname=PRECOUNT.
STRING FileName (A50).
COMPUTE FileName = 'FILENAME'.
EXECUTE.

CodePudding user response：

Here's a way, using both OMS and a macro. The macro will use OMS in order to give you a count of variables in the active dataset.

define !countVars ()
dataset name orig.
DATASET DECLARE  tmp.
OMS /SELECT TABLES /IF COMMANDS=['File Information'] SUBTYPES=['Variable Information'] 
    /DESTINATION FORMAT=SAV OUTFILE='tmp' VIEWER=no.
display dictionary.
omsend.
dataset activate tmp.
freq Command_.
dataset activate orig.
dataset close tmp.
!enddefine.

Now that the macro is defined you can call it at any point in the syntax like this:

!countVars .

and see the number of variables in the active dataset under "frequency" in the output table.

CodePudding user response：

Just for the sake of completeness: Getting the variable count using SPSS's builtin python methods is pretty straightforward.

begin program.
import spss
print(spss.GetVariableCount())
end program.

This snippet simply prints the number of variables in the active dataset to the output window. Assigning the value to a variable periodically throughout your script would make it pretty easy to track the number of variables across multiple merges.