Substring extraction with multiple options-CodePudding

I have a variable in Stata in my dataset that looks like this:

city
Washington city
Boston city
El Paso city
Nashville-Davidson metropolitan government (balance)
Lexington-Fayette urban county

And I want it to look like:

city
Washington
Boston
El Paso
Nashville-Davidson
Lexington-Fayette

"city," "county," and "urban county" are the only three words that follow after a city name. In other words, I want to extract the substring from left to the space before either city, county, or urban.

The only way I can think of approaching this using subinstring:

replace city = subinstr(city, " city", "", .)

I don't think, however, that I can add multiple options here.

CodePudding user response：

I used subinstr to replace the desired words with empty strings, and trim to remove additional spaces.

input str60(city)

"Washington city"
"Boston city"
"El Paso city"
"Lexington-Fayette urban county"
"Audacity"

end

gen     wanted = subinstr(city,"urban county","",1)
replace wanted = subinstr(wanted," county","",1)
replace wanted = subinstr(wanted," city","",1)

replace wanted = trim(wanted)

list

      ---------------------------------------------------- 
     |                           city              wanted |
     |----------------------------------------------------|
  1. |                Washington city          Washington |
  2. |                    Boston city              Boston |
  3. |                   El Paso city             El Paso |
  4. | Lexington-Fayette urban county   Lexington-Fayette |
  5. |                       Audacity            Audacity |
      ----------------------------------------------------

Edit: As suggested, I have incorporated a leading space so that places with "city" in their name (e.g. Audacity) are not inadvertently replaced. The same for "county" (although this seems less likely).

CodePudding user response：

split could be a way.

split city, parse(" city" " urban" " county") limit(1)