I have a large dataset that contains one column called "TYPE_DESCRIPTION" that describes the type of activity of each observation.
However, the raw dataset that I obtained somehow may contain more than one repetition of the same activity within the "TYPE_DESCRIPTION" column. Let's say for one observation, the activity (or value) shown within the "TYPE_DESCRIPTION" column can contain "Walking, Walking, Walking, Walking", instead of just "Walking". How do I remove the repetition of "Walking" within that column so I only have the value once?
I have tried the distinct() function, but it defines the "Walking, Walking, Walking, Walking" as one unique value. Whereas what I want is just "Walking".
This became a problem when later I want to add a new column using mutate() that groups the activity into higher order and write "Walking" in the codes. Since I only write "Walking" on the code, it does not recognize the variation of 'Walking' with different repetition and put it under different category that I need it to be.
Thanks.
CodePudding user response:
df <- data.frame(TYPE_DESCRIPTION = c("Walking,Walking, Walking, Walking",
"Running, Walking"))
library(tidyverse)
df %>%
mutate(row = row_number()) %>%
separate_rows(TYPE_DESCRIPTION, sep = ",") %>%
mutate(TYPE_DESCRIPTION = str_trim(TYPE_DESCRIPTION)) %>%
distinct(row, TYPE_DESCRIPTION)
Result
# A tibble: 3 × 2
TYPE_DESCRIPTION row
<chr> <int>
1 Walking 1
2 Running 2
3 Walking 2
CodePudding user response:
in Base R:
transform(df, uniq=sapply(strsplit(TYPE_DESCRIPTION, ', ?'), \(x)toString(unique(x))))
TYPE_DESCRIPTION uniq
1 Walking,Walking, Walking, Walking Walking
2 Running, Walking Running, Walking