How to extract strings between a set of symbols in R and store them in a vector-CodePudding

mystring <- "\n\n-Acanthosis nigricans\n-Hyperpigmentation\n-Hyperkeratosis\n-Skin fold regions\n-Neck\n-Groin\n-Axillae\n-Obesity \n-Drug-induced AN\n-Malignant AN"

I would like to extract the terms between \n- and \n and store it as a vector:

> mystring_extracted

 [1] "Acanthosis nigricans" "Hyperpigmentation"    "Hyperkeratosis"       "Skin fold regions"   
 [5] "Neck"                 "Groin"                "Axillae"              "Obesity"             
 [9] "Drug-induced AN"      "Malignant AN"

I tried the following, but it didn't do what I wanted:

> gsub("\n-", "", mystring)
[1] "\nAcanthosis nigricansHyperpigmentationHyperkeratosisSkin fold regionsNeckGroinAxillaeObesity Drug-induced ANMalignant AN"

CodePudding user response：

Use strsplit. It will return a list which in this case contains one component which is almost the desired character vector so use [[1]] to get that and then remove the junk first element. No packages are used.

strsplit(mystring, "\n-")[[1]][-1]

giving:

 [1] "Acanthosis nigricans" "Hyperpigmentation"    "Hyperkeratosis"      
 [4] "Skin fold regions"    "Neck"                 "Groin"               
 [7] "Axillae"              "Obesity "             "Drug-induced AN"     
[10] "Malignant AN"

A variation of that is the following which first removes the junk at the beginning and then performs the split and performs an unlist to get the character vector.

mystring |>
  trimws(whitespace = "[\n-]") |>
  strsplit("\n-") |>
  unlist()