Home > Software design >  R finding duplicate values
R finding duplicate values

Time:06-27

enter image description hereenter image description hereI have a table called Data with 10 columns and among them there are columns named: suitcases, columns, rows, objects.

In the suitcases column, the values go from 1 to 10.

In the columns column, the values go from 1 to 20

In the rows column, the values go from 1 to 20

The numbering of values in the objects column starts from 1 for each suitcas.

I tried the following method (a similar one appeared on the forum with a different question):

duplicates <- function(data, var)
{
  library(tidyverse)
  data |> 
    add_count(!sym(var)) |> 
    filter(n == 2) |> 
    select(-n)
}


for (x in suitcases) {
  duplicates(Data, objects)  
}

I want to get a new table in which there are only such rows in which the values for the objects column occur exactly twice and not more, taking into account the resetting of the numbering in the suitcases column and the values in the columns: columns and rows. Due to the re-numbering, repetitions may appear in subsequent suitcases (despite the same values in columns: columns and rows)

Unfortunately, I have no idea how to take into account the resetting numbering. Therefore, I am asking the forum for help and indulgence, if the question is not well-formed, I am new here.

Example_data

Example_output

structure(list(rows = c(6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 5L, 
    5L, 5L, 5L, 6L, 6L), columns = c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 
    3L, 6L, 6L, 6L, 6L, 6L, 3L, 3L), time.min = c(5L, 0L, 5L, 0L, 
    0L, 5L, 5L, 5L, 0L, 2L, 5L, 0L, 2L, 10L, 10L), status = c(38L, 
    66L, 57L, 38L, 57L, 20L, 20L, 3L, 58L, 58L, 14L, 14L, 5L, 5L, 
    27L), postion = c(38L, 17L, 6L, 7L, 31L, 31L, 32L, 21L, 2L, 67L, 
    1L, 31L, 6L, 35L, 37L), x = c(58L, 14L, 14L, 14L, 68L, 12L, 27L, 
    448L, 981L, 860L, 147L, 417L, 884L, 417L, 884L), y = c(216L, 
    212L, 483L, 520L, 234L, 515L, 521L, 795L, 93L, 668L, 75L, 787L, 
    310L, 827L, 144L), z = c(38L, 66L, 57L, 38L, 57L, 20L, 20L, 1L, 
    7L, 6L, 981L, 147L, 781L, 417L, 884L), suitcases = c(3L, 3L, 
    3L, 2L, 7L, 7L, 7L, 7L, 5L, 1L, 4L, 3L, 3L, 10L, 10L), objects = c(6L, 
    1L, 6L, 22L, 5L, 14L, 27L, 14L, 1L, 14L, 1L, 26L, 5L, 4L, 4L)), class = "data.frame", row.names = c(NA, 
    -15L))

CodePudding user response:

You can approach this though grouping and filtering, just note that your expected output is bit unclear about group order (or resetting the numbering, as you put it), meaning that different order can provide the same result on provided sample, but on your real dataset you might expect something else :

library(dplyr)
Data %>% 
  group_by(rows,columns,suitcases,objects) %>%
  filter (n() == 2) %>%
  ungroup()

Result:

#> # A tibble: 4 × 10
#>    rows columns time.min status postion     x     y     z suitcases objects
#>   <int>   <int>    <int>  <int>   <int> <int> <int> <int>     <int>   <int>
#> 1     6       3        5     38      38    58   216    38         3       6
#> 2     6       3        5     57       6    14   483    57         3       6
#> 3     6       3       10      5      35   417   827   417        10       4
#> 4     6       3       10     27      37   884   144   884        10       4
  • Related