I have a monthly excel spreadsheet with the following:
Category | Description |
---|---|
A | free text in paragraph form |
B | free text in paragraph form |
C | free text in paragraph form |
B | free text in paragraph form |
B | free text in paragraph form |
A | free text in paragraph form |
I would like to add a third column that adds tags or keywords from a predetermined list that searches the free text and then pre-populates it based on whether one or more of the terms is found there or not.
So for example a list of tags could be price, distance, availability, location, and so on with the Keywords or Tags column populated based on the free text in the second column as below
Category | Description | Keywords or Tags |
---|---|---|
A | Really doesn't like the price and location is too far | price, location |
B | The distance is an issue and not too much availability | Distance, availability |
C | Location is close so I like the convenience | location, convenience |
B | The distance is near and there is a lot of availability | availability, distance |
As shown above, the tags would be separated by commas.
The issue is that the list of predetermined keywords is large (around 20 to 30 tags).
My Questions:
What would be the most efficient way to create this list without removing any tags?
Also, is there a way to do this in RStudio?
CodePudding user response:
A simple solution, which uses Excel formulas and avoids any external dependencies:
- use the
=SEARCH()
function in excel to find tags, populating a column for each keyword or tag - Use
=TEXTJOIN()
to aggregate all tags
Example:
A | B | C | D | E | F | G | |
---|---|---|---|---|---|---|---|
1 | Category | Description | Tags (all) | price | location | Distance | availability |
2 | A | Really doesn't like the price and location is too far | '=TEXTJOIN(", ", TRUE, D2:XX2) | '=IF(ISERROR(SEARCH(D$1,$B2)), "", D$1) | ------ | ------ | ----> |
Output:
A | B | C | D | E | F | G | |
---|---|---|---|---|---|---|---|
1 | Category | Description | Tags (all) | price | location | Distance | availability |
2 | A | Really doesn't like the price and location is too far | price, location | price | location |
CodePudding user response:
We can use regular expressions here to extract the keywords from the strings.
If we put the keywords in a vector keywords
, we can use the str_extract_all
from the stringr
package to extract all matching words in the string. I've made it into a simple function which we apply to the Description
column of your data.frame, inserting the results into a new variable Keys
library(stringr)
get_tags <- function(str, tags) {
res = str_extract_all(str,
regex(tags, ignore_case = T), # Search case insensitive
simplify = T)[,1] # Get result as vector, not matrix
return(res[nchar(res) > 0]) # Drop empty strings from non-matched keywords
}
df$Keys <- sapply(df$Description,
function(x) paste0(get_tags(x, keywords),
collapse=', ')) # Collapse matches w/ commas
df
Category Description Keys
1 A Really doesn't like the price and location is too far price, location
2 B The distance is an issue and not too much availability distance, availability
3 C Location is close so I like the convenience Location, convenience
4 D The distance is near and there is a lot of availability distance, availability
Since you want the matches to be case insensitive, putting the regex pattern (tags
) in the regex
function allows us to specify that it should ignore case.