Home > OS >  Filter dataframe in R
Filter dataframe in R

Time:05-20

I'm trying to see the change of occupations in the industry over time, so I have the occupations IDs like an engineer, teacher, and lawyer, and several IDs that correspond to the sector like construction, education, mineral extraction, fishing... I would like to extract, from each occupation -and the largest and smallest change. A sample of the data follows. In this example, I would like to extract the top 3 largest positive changes, and the top 3 largest negative changes Could you guys help me?

x <- data.frame("occ_id" = c(1010, 1010, 1010, 1010, 1010, 1010, 1010,1234,1234,1234,1234, 4321, 4321,4321,4321,4321),
                "Ind_id" = c(52418,52417,28339,27138,31224,33103,1112,27138,31224,1112,52418,33103,31224,1112,52417,26301),
                "Change_occ_2000_2022" = c(1, -5 , 8 ,9 , - 11 ,15 ,16 ,-50,10,30,-5,20,10,50,30,-50))

then I tried this

x %>% 
  count(Change_occ_2000_2022) %>%
  arrange(Change_occ_2000_2022) %>%
  slice(c(head(row_number(), 3), tail(row_number(), 3)))

But doing this I could not capture the pair occ-ind that this change belongs. I would like to appear the pair to these changes like this:

x <- data.frame("occ_id" = c(4321, 4321, 1234, 1234, 4321, 1010),
                    "Ind_id" = c(1112,52417,1112,27138,26301, 31224 ),
                    "Change_occ_2000_2022" = c(50,30,30, -50, -50, -11))

CodePudding user response:

 library(dplyr) 
 x %>% 
    group_by(occ_id) %>% 
    arrange(-Change_occ_2000_2022) %>% 
    ungroup() %>% 
    slice(c(head(row_number(), 3), tail(row_number(), 3)))

Output:

occ_id Ind_id Change_occ_2000_2022
   <dbl>  <dbl>                <dbl>
1   4321   1112                   50
2   1234   1112                   30
3   4321  52417                   30
4   1010  31224                  -11
5   1234  27138                  -50
6   4321  26301                  -50

CodePudding user response:

library(dplyr)

x %>% 
  arrange(desc(Change_occ_2000_2022)) %>% 
  slice(c(1:3, (nrow(.) - 2):nrow(.)))

Output

  occ_id Ind_id Change_occ_2000_2022
1   4321   1112                   50
2   1234   1112                   30
3   4321  52417                   30
4   1010  31224                  -11
5   1234  27138                  -50
6   4321  26301                  -50

CodePudding user response:

based on the solution LMc

df <- data.frame("occ_id" = c(1010, 1010, 1010, 1010, 1010, 1010, 1010,1234,1234,1234,1234, 4321, 4321,4321,4321,4321),
                "Ind_id" = c(52418,52417,28339,27138,31224,33103,1112,27138,31224,1112,52418,33103,31224,1112,52417,26301),
                "Change_occ_2000_2022" = c(1, -5 , 8 ,9 , - 11 ,15 ,16 ,-50,10,30,-5,20,10,50,30,-50))

library(data.table)
setDT(df)[order(Change_occ_2000_2022), .SD[c(1:3, (.N-2):.N)]]
#>    occ_id Ind_id Change_occ_2000_2022
#> 1:   1234  27138                  -50
#> 2:   4321  26301                  -50
#> 3:   1010  31224                  -11
#> 4:   1234   1112                   30
#> 5:   4321  52417                   30
#> 6:   4321   1112                   50

Created on 2022-05-19 by the reprex package (v2.0.1)

or

setDT(df)[frankv(Change_occ_2000_2022, ties.method = "dense") <= 2 |
            frankv(-Change_occ_2000_2022, ties.method = "dense") <= 2][order(Change_occ_2000_2022)]

if you need to take into account duplicate values

CodePudding user response:

x<-x %>%
  arrange(Change_occ_2000_2022)
x<-rbind(head(x, 3), tail(x, 3))
  • Related