Home > Blockchain >  Creating new variables from word list and assigning 1 or 0 if the word appears in the string of a se
Creating new variables from word list and assigning 1 or 0 if the word appears in the string of a se

Time:10-20

I have a variable where the observation will have various notes put into it by a person. Some of the words in any given observation could be key words that need to be tracked.

If I have a list of the key words, is there a streamlined way to create variables from that list, and then search through the existing observations to flag whether or not the word is in there? An extra component is that due to the human element, words can't be counted on to be in a particular order or delimiters such as a space may be omitted, letters upper/lower case. There is also the possibility that a word like "flights" might be missing the "s." Because the keywords may change, is there also a way to code it so that the words can be created as a value that can be updated and then rerun to update variables?

In the df below the list of key words I'm looking for abc, xyz, flights.

df <- read.table(text =
                   "ID Notes
ID-0001   'ABC project xyz'
ID-0002   'XYZ'
ID-0003   'ABCschedule flightsok test'
ID-0004   'flight, abc' 
ID-0005   'normal notes no key'", header = T)

The desired output would look like this:

desired.output <- read.table(text =
                               "ID Notes abc xyz flights
ID-0001   'ABC project xyz'  1  1  0  
ID-0002   'XYZ' 0  1  0
ID-0003   'ABCschedule flightsok'  1  0  1
ID-0004   'flight, abc' 1  0  1 
ID-0005   'normal notes no key'  0  0  0 ", header = T)

I found this similar question but it wasn't quite what I was looking for, due to the variable names being created from every word in an observation. R: Splitting a string to different variables and assign 1 if string contains this word

Thank you for the help!

CodePudding user response:

We may use grepl for this

transform(df, abc =  (grepl('\\babc', Notes, ignore.case = TRUE)), 
     xyz =  (grepl('\\bxyz\\b', Notes, ignore.case = TRUE)), 
     flights =  (grepl('\\bflights?', Notes, ignore.case = TRUE)))
       ID                      Notes abc xyz flights    
1 ID-0001            ABC project xyz   1   1       0
2 ID-0002                        XYZ   0   1       0
3 ID-0003 ABCschedule flightsok test   1   0       1
4 ID-0004                flight, abc   1   0       1
5 ID-0005        normal notes no key   0   0       0

Or just loop over the words of interest and use grepl

df[c('abc', 'xyz', 'flights')] <-  (sapply(c('abc', 'xyz', 'flights'), function(x) grepl(x, df$Notes)))
  • Related