I want to recode my variable Ucod
in Stata with >100000 different observations into 3-4 classified values in the form of a new variable.
The problem is that I don't want to enter all the values of Ucod
to recode. For example I want to use an if
condition like if any value in Ucod
starts with I
(e.g, I234
, I345
, I587
) recode the whole value to CVD
.
I have tried using strpos()
function using different conditions but I was unsuccessful.
Attaching picture of my data and variable Ucod
CodePudding user response:
You could just use gen
and a series of replace
commands:
gen ucod_category = 0 if ucod >= "I00" & ucod <= "I519"
replace ucod_category = 1 if ucod >= "I60" & ucod <= "I698"
Then label these categories as CVD, Stroke, etc. This should sort in the expected way for your I10 codes with missing decimal points (e.g. "I519" < "I60").
However it might be more convenient to convert ucod
into a number (with first digit 0 for A, 1 for B etc.) so that you can recode it with labels in a single command:
gen ucod_numeric = (ascii(substr(ucod, 0, 1)) - 65) * 1000 real(substr(ucod, 1)) / cond(strlen(ucod) == 4, 10, 1)
recode ucod_numeric (800/851.9=0 "CVD") (860/869.8=1 "Stroke"), generate(ucod_category)
Again, this should sort in the expected order: I519 (which becomes 851.9) < I60 (860).
EDIT: since ascii
isn't working (possibly a Stata version issue) you can try something like this to change the letter to a number.
gen ucod_letter_code = -1
forvalues i = 0/25 {
replace ucod_letter_code = `i' if substr(ucod, 1) == char(`i' 65)
}
gen ucod_numeric = ucod_letter_code * 1000 real(substr(ucod, 1)) / cond(strlen(ucod) == 4, 10, 1)
recode ucod_numeric (800/851.9=0 "CVD") (860/869.8=1 "Stroke"), generate(ucod_category)