Is there an R function to remove repetition **within** observation?-CodePudding

I have a large dataset that contains one column called "TYPE_DESCRIPTION" that describes the type of activity of each observation.

However, the raw dataset that I obtained somehow may contain more than one repetition of the same activity within the "TYPE_DESCRIPTION" column. Let's say for one observation, the activity (or value) shown within the "TYPE_DESCRIPTION" column can contain "Walking, Walking, Walking, Walking", instead of just "Walking". How do I remove the repetition of "Walking" within that column so I only have the value once?

I have tried the distinct() function, but it defines the "Walking, Walking, Walking, Walking" as one unique value. Whereas what I want is just "Walking".

This became a problem when later I want to add a new column using mutate() that groups the activity into higher order and write "Walking" in the codes. Since I only write "Walking" on the code, it does not recognize the variation of 'Walking' with different repetition and put it under different category that I need it to be.

Thanks.

CodePudding user response：

df <- data.frame(TYPE_DESCRIPTION = c("Walking,Walking, Walking, Walking", 
                                      "Running, Walking"))


library(tidyverse)
df %>%
  mutate(row = row_number()) %>%
  separate_rows(TYPE_DESCRIPTION, sep = ",") %>%
  mutate(TYPE_DESCRIPTION = str_trim(TYPE_DESCRIPTION)) %>%
  distinct(row, TYPE_DESCRIPTION)

Result

# A tibble: 3 × 2
  TYPE_DESCRIPTION   row
  <chr>            <int>
1 Walking              1
2 Running              2
3 Walking              2

CodePudding user response：

in Base R:

transform(df, uniq=sapply(strsplit(TYPE_DESCRIPTION, ', ?'), \(x)toString(unique(x))))

                   TYPE_DESCRIPTION             uniq
1 Walking,Walking, Walking, Walking          Walking
2                  Running, Walking Running, Walking