Home > Software engineering >  Keep unique elements of a list
Keep unique elements of a list

Time:11-16

I have a dataframe with 1.6 Million rows and one of the variables is a list.

This variable looks as follows : c("A61K", "A61K", "A61K", "A61K", "A61K", "A61K", "A61K", "A61K", "A61K", "A61K", "A61Q", "B05B").

I would like for it to be c("A61K","A61Q","B05B").

Meaning I just want to keep the unique values. This process should be repeated for each row.

I have tried this:

sapply(strsplit(try, "|", function(x) paste0(unique(x), collapse = ",")))

And solutions using for loops but it takes very long and R stops running.

CodePudding user response:

Use unique

> string <- c("A61K", "A61K", "A61K", "A61K", "A61K", "A61K", "A61K", "A61K", "A61K", "A61K", "A61Q", "B05B")
> unique(string)
[1] "A61K" "A61Q" "B05B"

CodePudding user response:

I interpreted your question as saying you have a list column of character vectors, and want each vector in the column to have unique values within that vector. If that’s correct, you can handle it like this:

# example df with list column
dat <- data.frame(id = 1:2)
dat$x <- list(
  c("A61K", "A61K", "A61K", "A61K", "A61K", "A61K", "A61K", "A61K", "A61K", "A61K", "A61Q", "B05B"),
  c("A62K", "A61K", "A61K", "A58J", "A61K", "A61K", "A61K", "A61K", "A61K", "A61K", "A61Q", "C97B")
)

dat 
  id                                                                      x
1  1 A61K, A61K, A61K, A61K, A61K, A61K, A61K, A61K, A61K, A61K, A61Q, B05B
2  2 A62K, A61K, A61K, A58J, A61K, A61K, A61K, A61K, A61K, A61K, A61Q, C97B
# remove duplicates within list column by row
dat$x <- lapply(dat$x, unique)

dat
  id                            x
1  1             A61K, A61Q, B05B
2  2 A62K, A61K, A58J, A61Q, C97B

CodePudding user response:

To filter the data frame use duplicated.

If this is your data

df
    str data
1  A61K    1
2  A61K   23
3  A61K    4
4  A61K    3
5  A61K    1
6  A61K   23
7  A61K    4
8  A61K    3
9  A61K    1
10 A61K   23
11 A61Q    4
12 B05B    3

Apply filter using desired column

df[!duplicated(df$str), ]
    str data
1  A61K    1
11 A61Q    4
12 B05B    3
  • Related