Group a string variable/create categorical variable-CodePudding

I am struggling to find a smooth way in Stata to create a categorical variable from my string variable.

My string variable is called BRANSCH and contains values for different sectors (see picture)

I would like to group these such that for example:

BRANSCH_NEW = 1 if BRANSCH==any of the following string values in the list ("10.112 10.120 10.130 10.390 10.520 10.710 10.721 10.822 10.840 10.850 10.890 11.020 11.030 12.000")

I have tried to encode and use the inrange() function without success.

encode BRANSCH1 branschcode

replace bransch=1 if inrange(branschcode, 10,12)
replace bransch=2 if inrange(branschcode, 16,17.3)
replace bransch=3 if inrange(branschcode, 19,20.6)

I tried to Google if it is possible to loop over a list of string values but I I haven't succeeded.

local livsmedel "10.112 10.120 10.130 10.390 10.520 10.710 10.721 10.822 10.840 10.850 10.890 11.020 11.030 12.000"
local livsmedel "`livsmedel'"

foreach c of local livsmedel {
        replace bransch=1 if BRANSCH1=="`c'"
    }

I would appreciate any help on how to best tackle this problem.

CodePudding user response：

encode replaces each value in a string variable with a categorical code and use the string value as label. So encode bransch turns all 10.112 into 1 with label "10.112", 10.120 into 2 with label "10.120" and so forth. That means that 10.112 is no longer within the range inrange(,10,12) as it is 1, 2 etc.

The approach you tried here would have been better with destring rather than encode. I think you think that encode does what destring do. As destring takes the string "10.112" and turns it into the number 10.112. However, "10.120" would be turned into 10.12 as for a number 10.120 and 10.12 are identical. Since this is sector codes and not numbers, that is not a great solution either.

Are all the codes in alphanumerically order? Meaning, for example, all codes between 10 and 12 are one sector, and 16 and 17.3 another sector etc. Meaning that there is no overlap in the high level categories?

If there is no overlap, you can do this:

gen branschcode = .
replace branschcode = 1 if bransch >= "10" & bransch < "12")
replace branschcode = 2 if bransch >= "16" & bransch < "17.3")
...
...

Not that if all in the code above, code 11.999 is in branschcode = 1 but 12.000 is not as this is an alphanumerical comparison and not a numerical one. But you should be able to adjust the cutoffs for your code to work.

CodePudding user response：

gen wanted = inlist(trim(BRANSCH), "10.112", "10.120", "10.130", " 10.390", "10.520", "10.710", "10.721", "10.822", "10.840", "10.850") | inlist(trim(BRANSCH),"10.890", "11.020", "11.030" "12.000")

is one alternative. See https://www.stata.com/manuals/fnprogrammingfunctions.pdf for the limit of 10 strings obeyed in that.

And here is another:


gen wanted = 0 

foreach s in 10.112 10.120 10.130 10.390 10.520 10.710 10.721 10.822 10.840 10.850 10.890 11.020 11.030 12.000 { 
    replace wanted = 1 if trim(BRANSCH) == "`s'" 
}

Neither is especially direct, but nor is playing around with encode or destring.