Filter rows that have 2 or more instances of certain values in R-CodePudding

I am having trouble making a frequency table where the data is in multiple columns.My dataset is structured like this:


ID  shape        shape2      shape3    shape4    shape5  
12  triangle     rhombus     circle     square   rectangle
68  hexagon      NA          NA         NA       NA
32  rectangle    octagon     square     NA       NA
18  square       triangle    NA         NA       NA      
27. circle.      NA.         NA.        NA.      NA
77. square.      trapezoid.  triangle.  NA.      NA

I want to get a frequency table of all of the different shapes, like this:


shape.     freq
triangle.   3
square.     4
circle.     2
hexagon.    1
octagon.    1
trapezoid.  1
rhombus.    1
rectangle   2

Then I really am only interested in 5 shapes, so I would like it to look like this:

shape.     freq
triangle.   3
square.     4
circle.     2
rectangle   2

Thank you!

CodePudding user response：

Say you have your data in dat, then:

library(data.table)
data.table(table(gsub("\\.", "", unlist(dat[,-1]))))[N>=2]

Note that I removed the . from some of the names. data.table isn't really necessary, I just find the syntax [N>=2] to be cleaner and more convenient than the base-r alternatives.

          V1 N
1:    circle 2
2:        NA 4
3: rectangle 2
4:    square 4
5:  triangle 3

CodePudding user response：

Does this work:

library(dplyr)
library(tidyr)

df %>% pivot_longer(-ID, values_to = 'shape') %>% count(shape, name = 'freq') %>% filter(shape %in% c('triangle','square','circle','rectangle'))

# A tibble: 4 x 2
  shape      freq
  <chr>     <int>
1 circle        2
2 rectangle     2
3 square        4
4 triangle      3

CodePudding user response：

Another solution, based also on data.table:

library(data.table)

dt <- data.table::data.table(
  ID = c(12L, 68L, 32L, 18L, 27L, 77L),
  shape = c("triangle","hexagon",
            "rectangle","square","circle","square"),
  shape2 = c("rhombus", NA, "octagon", "triangle", NA, "trapezoid"),
  shape3 = c("circle", NA, "square", NA, NA, "triangle"),
  shape4 = c("square", NA, NA, NA, NA, NA),
  shape5 = c("rectangle", NA, NA, NA, NA, NA)
)

melt(dt, id="ID", 2:6, na.rm = T, value.name = "shape"
     )[,.(freq = .N), by = "shape"][freq > 1]

#>        shape freq
#> 1:  triangle    3
#> 2: rectangle    2
#> 3:    square    4
#> 4:    circle    2