Home > Back-end >  Removing nested duplicates
Removing nested duplicates

Time:10-07

I managed, after many trials, to transform my nested list of results into a dataframe. The problem is that there are nested duplicates in this dataframe, and no matter which code I try, I couldn't fix the problem.

Here is the head of the dataframe:

> df[1:12]
    TuteeID Tutee_Type Tutee_Syll_Cons
 1:    G313          A       0.7020889
 2:    G313          A       0.7573333
 3:    G313          A       0.7731556
 4:    G313          C       0.7020889
 5:    G313          C       0.7573333
 6:    G313          C       0.7731556
 7:    G313          D       0.7020889
 8:    G313          D       0.7573333
 9:    G313          D       0.7731556
10:    G315          B       0.7762000
11:    G315          B       0.8324222
12:    G315          B       0.8560222

To explain with an example, for the individual G313 I have the types A, C and D and one consistency value for each type. But in my dataframe, each consistency value is assigned to each type. I need something like this:

> df2
  TuteeID Tutee_Type Tutee_Syll_cons
1    G313          A       0.7020889
2    G313          C       0.7573333
3    G313          D       0.7731556
4    G315          B       0.7762000

Because of this nesting (I guess), nothing worked so far. I tried unique, distinct, duplicate, subset, group and slice ... I also created this dataframe by joining 2 dataframes with only TuteeID and type or consistency values, but even if these 2 smaller dataframes had no duplicates, the global dataframe has the same problem.

Do you have a solution?

CodePudding user response:

I agree with @Robin's suggestion that it is better to solve this upstream rather than fixing it later.

However, if you have received data in this format or you cannot change the earlier data here is a way to get only the required part of the data.

library(dplyr)

df %>%
  group_by(TuteeID) %>%
  mutate(index = match(Tutee_Type, unique(Tutee_Type))) %>%
  group_by(index, .add = TRUE) %>%
  slice(first(index)) %>%
  select(-index) %>%
  ungroup

#  index TuteeID Tutee_Type Tutee_Syll_Cons
#  <int> <chr>   <chr>                <dbl>
#1     1 G313    A                    0.702
#2     2 G313    C                    0.757
#3     3 G313    D                    0.773
#4     1 G315    B                    0.776

CodePudding user response:

Ok, so here is how I got this dataframe:

I had matrices of consistency scores, one matrix for each type of each individual. With nested loops, I calculated the consistency mean of each matrix, it gives me the consistency values. The loops saved in a first list the individual ID, in a second list the type, and in a third list the consistency value. Elements of the second and thirs lists are not equal in size, because the individuals have different number of types.

I started with this. The list containing the 3 lists of data is called 'myresults'.

syll_cons <- do.call(cbind, myresults)
syll_cons2 <- as.data.frame(syll_cons)

> Syll_cons2
  TuteeID Tutee_Type                        Tutee_Syll_cons
1    G313        ACD                 0.7020, 0.7573, 0.7731
2    G315        BCD                 0.7762, 0.8324, 0.8560
3    G322      ABCDE 0.7151, 0.8044, 0.6102, 0.7546, 0.7893
4    G323          C                                 0.5845

Then I used tidyr::separate and separated the type and the consistency values into multiple columns. It gave me something like that (with more Cons columns):

> head(syll_cons3)
    ID T1   T2   T3   T4   T5              Cons1              Cons2  
1 G313  A    C    D <NA> <NA>  0.702088888888889  0.757333333333333  
2 G315  B    C    D <NA> <NA>             0.7762  0.832422222222222  
3 G322  A    B    C    D    E  0.715155555555556  0.804466666666667  
4 G323  C <NA> <NA> <NA> <NA>  0.584555555555556               <NA>               
5 G325  A    B    C    D    E  0.829177777777778  0.921266666666667  
6 G326  C    D <NA> <NA> <NA>  0.621666666666667  0.709533333333333               

Then I used pivot_longer to transform those multiple columns into lines. I created one dataframe for types and one for consistency values:

syllable_cons <- pivot_longer(syll_cons3, starts_with("Cons"), values_to = "Syll_cons")
syllable_cons <- syllable_cons[complete.cases(syllable_cons$Syll_cons), ]
syllable_cons2 <- pivot_longer(syll_cons3, starts_with("T"), values_to = "Tutee_Type")
syllable_cons2 <- syllable_cons2[complete.cases(syllable_cons2$Tutee_Type), ]

syllable_cons <- syllable_cons[,c(1,10)]
syllable_cons2 <- syllable_cons2[,c(1,10)]

> head(syllable_cons)
  ID    Syll_cons
1 G313      0.702
2 G313      0.757
3 G313      0.773
4 G315      0.776
5 G315      0.832
6 G315      0.856

> head(syllable_cons2)
  ID    Tutee_Type  
1 G313  A         
2 G313  C         
3 G313  D         
4 G315  B         
5 G315  C         
6 G315  D 

And finally I used full_join to merge the dataframes and get the dataframe I showed in my question. I hope it is clear enough.

  • Related