I am trying to clean a dataframe by deleting the wrongly added rows.
This is the dummy data :
temp <- structure(list(Date = c("24/06/2002", "24/06/2002", "25/06/2002","25/06/2002", "26/06/2002",
"02/07/2002","03/07/2002","24/07/2002", "08/07/2002",
"08/07/2002", "15/07/2002", "17/07/2002",
"22/07/2002", "22/07/2002", "28/07/2002", "29/07/2002"),
payment = c(200, 1000,-1000, -1000, 1000,
-1000,-1000,-1000, 1200,
-1200, 1200, 1200,
200, 56700, -56700, -200),
Code = c("ABC", "M567", "M567","M567", "XYZ", "M567", "ABX" ,
"M567","M567", "M567",
"M567", "M567", "M300",
"M678", "M678", "ABC"),
ID = c(NA, "98","187","187","12ee","M11","M13",
NA,"K999",
"K999", "111", "111", "11",
"12345", NA, NA)), row.names = c(NA, -16L), class = "data.frame")
The dataframe looks this this
Date payment Code ID
1 24/06/2002 200 ABC <NA>
2 24/06/2002 1000 M567 98
3 25/06/2002 -1000 M567 187
4 25/06/2002 -1000 M567 187
5 26/06/2002 1000 XYZ 12ee
6 02/07/2002 -1000 M567 M11
7 03/07/2002 -1000 ABX M13
8 24/07/2002 -1000 M567 <NA>
9 08/07/2002 1200 M567 K999
10 08/07/2002 -1200 M567 K999
11 15/07/2002 1200 M567 111
12 17/07/2002 1200 M567 111
13 22/07/2002 200 M300 11
14 22/07/2002 56700 M678 12345
15 28/07/2002 -56700 M678 <NA>
16 29/07/2002 -200 ABC <NA>
As you can see there are some positive and negative payments in the data. The negative payments are the wrongly added transactions or refunds.
For example 1200 will cancel out with -1200 based on Code and ID whereas row 14 and 15 are similar but the ID is NA - So I have to fill that with ID of its positive payment row and viseversa. So that I can delete those both rows.
Code I tried with the help of a programmer on StackOverflow (previously asked):
library(dplyr)
library(data.table)
library(tidyr)
Final_df <- df1 %>%
group_by(Code) %>%
mutate(ind = rowid(payment)) %>%
group_by(ind, .add = TRUE) %>%
fill(ID, .direction = 'downup') #%>%
ungroup %>%
mutate(absPayment = abs(payment)) %>%
arrange(ID, Code, absPayment) %>%
group_by(Code, ID, absPayment) %>%
mutate(grp = rowid(sign(payment))) %>%
group_by(grp, .add = TRUE) %>%
filter(n() == 1) %>%
ungroup %>%
select(names(df1))
But the problem here is row 8 - 24/07/2002 -1000 M567 should be filled by row 2 as the code and positive payment is matched - so that later I can cancel these both rows. Since the row is far away from row 8 .direction = 'downup' is not working.
And I think there is a better way to fill NA's other than using direction( As it is not getting applied with similar rows far away)
The Expected output is:
Date payment Code ID
1 25/06/2002 -1000 M567 187
2 25/06/2002 -1000 M567 187
3 26/06/2002 1000 XYZ 12ee
4 02/07/2002 -1000 M567 M11
5 03/07/2002 -1000 ABX M13
6 15/07/2002 1200 M567 111
7 17/07/2002 1200 M567 111
8 22/07/2002 200 M300 11
I am struck at this since 5 days. Any solutions would be really helpful.
Thanks in advance
Another possible Dummy data:
temp_2 <- structure(list(Date = c("22/06/2002", "23/06/2002","24/06/2002", "25/06/2002","25/06/2002", "26/06/2002",
"02/07/2002","03/07/2002","24/07/2002", "08/07/2002",
"08/07/2002", "15/07/2002", "17/07/2002",
"22/07/2002", "22/07/2002", "28/07/2002", "29/07/2002"),
payment = c(200,-1000, 1000,-1000, -1000, 1000,
-1000,-1000,-1000, 1200,
-1200, 1200, 1200,
200, 56700, -56700, -200),
Code = c("ABC", "M567","M567", "M567","M567", "XYZ", "M567", "ABX" ,
"M567","M567", "M567",
"M567", "M567", "M300",
"M678", "M678", "ABC"),
ID = c(NA,"187", "98","187","187","12ee",NA,NA,
NA,"K999",
"K999", "111", "111", "11",
"12345", NA, NA)), row.names = c(NA, -17L), class = "data.frame")
CodePudding user response:
Here is my attempt at solving it, the trick is to replace the NAs correctly.
# fill NAs according to their values
temp <- temp %>%
mutate(abs_payment = abs(payment)) %>%
group_by(abs_payment, ID, Code) %>%
# should consider replacement only if ID has only one row or if it is NA
mutate(is_candidate = (n() == 1) | is.na(ID)) %>%
group_by(abs_payment, Code) %>%
# we do not want to replace IDs for non-na IDs
mutate(new_ID = case_when(is_candidate & is.na(ID) ~ na.omit(ID)[1],
TRUE ~ ID))
# remove if sum equal to 0
temp <- temp %>%
group_by(Code, new_ID, abs_payment) %>%
mutate(total = sum(payment)) %>%
filter(total != 0 )
CodePudding user response:
I'm not sure if this is correct, but here is my attempt. I don't see exactly how you are deriving the expected output. Are their other filtering criteria? Your original data has more than 8 rows.
library(tidyverse)
temp |>
mutate(Date = lubridate::dmy(Date)) |>
arrange(Code, abs(payment)) |>
group_by(Code, abs(payment), ID) |>
mutate(n = n()) |>
ungroup()|>
group_by(Code, abs(payment), n) |>
fill(ID, .direction = "updown") |>
ungroup()|>
select(names(temp)) |>
arrange(Date, abs(payment))
#> # A tibble: 16 x 4
#> Date payment Code ID
#> <date> <dbl> <chr> <chr>
#> 1 2002-06-24 200 ABC <NA>
#> 2 2002-06-24 1000 M567 98
#> 3 2002-06-25 -1000 M567 187
#> 4 2002-06-25 -1000 M567 187
#> 5 2002-06-26 1000 XYZ 12ee
#> 6 2002-07-02 -1000 M567 M11
#> 7 2002-07-03 -1000 ABX M13
#> 8 2002-07-08 1200 M567 K999
#> 9 2002-07-08 -1200 M567 K999
#> 10 2002-07-15 1200 M567 111
#> 11 2002-07-17 1200 M567 111
#> 12 2002-07-22 200 M300 11
#> 13 2002-07-22 56700 M678 12345
#> 14 2002-07-24 -1000 M567 M11
#> 15 2002-07-28 -56700 M678 12345
#> 16 2002-07-29 -200 ABC <NA>