Home > Enterprise >  clean the dataframe
clean the dataframe

Time:10-18

I have a dataset looking like this

temp <- structure(list(Date = c("23/06/2002", "24/06/2002", "25/06/2002", 
"25/06/2002", "26/06/2002", "02/07/2002", "03/07/2002", "24/07/2002", 
"15/07/2002", "17/07/2002", "22/07/2002"), payment = c(-1000, 
1000, -1000, -1000, 1000, -1000, -1000, -1000, 1200, 1200, 200
), Code = c("M567", "M567", "M567", "M567", "XYZ", "M567", 
"ABX", "M567", "M567", "M567", "M300"), ID = c("187", "98", 
"187", "187", "12ee", NA, NA, NA, "111", "111", "11")), class = c("data.table", 
"data.frame"), row.names = c(NA, -11L), groups = structure(list(
    assignment = c("ABX", "M300", "M567", "XYZ"), .rows = structure(list(
        7L, 11L, c(1L, 2L, 3L, 4L, 6L, 8L, 9L, 10L), 5L), ptype = integer(0), class = 
   c("vctrs_list_of", 
    "vctrs_vctr", "list"))), row.names = c(NA, -4L), class = c("tbl_df", 
"tbl", "data.frame"), .drop = TRUE), .internal.selfref = <pointer: 0x0000020274961ef0>)

Input:

          Date payment      Code ID
 1: 23/06/2002  -1000       M567  187
 2: 24/06/2002   1000       M567   98
 3: 25/06/2002  -1000       M567  187
 4: 25/06/2002  -1000       M567  187
 5: 26/06/2002   1000        XYZ 12ee
 6: 02/07/2002  -1000       M567 <NA>
 7: 03/07/2002  -1000        ABX <NA>
 8: 24/07/2002  -1000       M567 <NA>
 9: 15/07/2002   1200       M567  111
10: 17/07/2002   1200       M567  111
11: 22/07/2002   200       M300   11

I want to delete row that would give sum = 0 for a same payment and Code Column (No need to consider ID Column)

For Example: 1000 and M567 should cancel out with any -1000 and M567 remaining rows should be the same

  • Its just that I have to cancel the pair of ve and -ve from the dataset

Expected Output:

          Date payment      Code ID
 1: 25/06/2002  -1000       M567  187
 2: 25/06/2002  -1000       M567  187
 3: 26/06/2002   1000        XYZ 12ee
 4: 02/07/2002  -1000       M567 <NA>
 5: 03/07/2002  -1000        ABX <NA>
 6: 24/07/2002  -1000       M567 <NA>
 7: 15/07/2002   1200       M567  111
 8: 17/07/2002   1200       M567  111
 9: 22/07/2002   200       M300   11

CodePudding user response:

Here is one possible way:

library(dplyr)
temp %>% 
  group_by(group =  as.integer(gl(n(), 2,n()))) %>% 
  mutate(x = sum(payment)) %>% 
  filter(!(x == 0 & first(Code) == last(Code))) %>% 
  ungroup() %>% 
  select(-x, -group)
  Date       payment Code  ID   
  <chr>        <dbl> <chr> <chr>
1 25/06/2002   -1000 M567  187  
2 25/06/2002   -1000 M567  187  
3 26/06/2002    1000 XYZ   12ee 
4 02/07/2002   -1000 M567  NA   
5 03/07/2002   -1000 ABX   NA   
6 24/07/2002   -1000 M567  NA   
7 15/07/2002    1200 M567  111  
8 17/07/2002    1200 M567  111  
9 22/07/2002     200 M300  11   

CodePudding user response:

You can figure out the codes for which the sum of payments is zero, put them in a vector and remove all rows from the original data.frame that contain this code:

totalpay = aggregate(temp$payment ~ temp$Code, FUN=sum)
zeropay = totalpay[totalpay[,2]==0,1]
temp = temp[!temp$Code %in% zeropay,]

As a bonus, you can retain the vector zeropay as a documentation of removed codes.

  • Related