Home > Software engineering >  If dataframe contents are not unique; subset, combine & rename
If dataframe contents are not unique; subset, combine & rename

Time:10-27

I've a dynamic dataframe of the following pattern:

structure(list(Date = structure(c(19304, 19305, 19311, 
19311, 19312), class = "Date"), Category = c("4", 
"6", "1", "0", "3"), Units_Sold = c(NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_), Raised = c(NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_), Method = c("Trad", 
"Trad", "Unknown", "Trad", "Unknown"), Day = c(8, 9, 15, 15, 16)), row.names = c(NA, 
-5L), class = c("tbl_df", "tbl", "data.frame"))

As you can probably see, there's two categories that have the same date. What I'd like to do is create a condition: if there are two rows with the same date, the df will be subsetted (say call it df_copy), and in that new df, one of the rows will be dropped and the contents of the "Category" column will be changed to say "Check Dataframe", and the "Method" column will be changed to say "Attention". Any advice most appreciated.

In answer to the question, I'd the dataframe to look something like this:

tibble [5 x 6] (S3: tbl_df/tbl/data.frame)
 $ Date : Date[1:5], format: "2022-11-08" "2022-11-09" "2022-11-15" "2022-11-16"
 $ Category: chr [1:5] "4" "6" "Check Dataframe" "3"
 $ Units_Sold: num [1:5] NA NA NA NA
 $ Raised: num [1:5] NA NA NA NA
 $ Method : chr [1:5] "Trade" "Trad" "Attention" "Unknown"
 $ Day: num [1:5] 8 9 15 15 16

CodePudding user response:

Using dplyr, group by Date, add your warning flags to rows where n() > 1, then use distinct() to drop duplicate rows:

library(dplyr)

df_copy <- df_orig %>%
  group_by(Date) %>%
  mutate(
    Category = ifelse(n() > 1, "Check Dataframe", Category),
    Method = ifelse(n() > 1, "Attention", Method)
  ) %>%
  ungroup() %>%
  distinct(Date, .keep_all = TRUE)

df_copy

Output:

# A tibble: 4 × 6
  Date       Category        Units_Sold Raised Method      Day
  <date>     <chr>                <dbl>  <dbl> <chr>     <dbl>
1 2022-11-08 4                       NA     NA Trad          8
2 2022-11-09 6                       NA     NA Trad          9
3 2022-11-15 Check Dataframe         NA     NA Attention    15
4 2022-11-16 3                       NA     NA Unknown      16

CodePudding user response:

We may group by 'Date', use an if/else condtion to change the values in 'Category', 'Method' columns, slice the first row and ungroup

library(dplyr)
df2 <-  df1 %>%
    group_by(Date) %>% 
    mutate(Category = if(n() > 1) "Check Dataframe" else Category,
           Method = if(n() > 1) "Attention" else Method) %>%
    slice(n=1) %>%
    ungroup

-output

 df2
# A tibble: 4 × 6
  Date       Category        Units_Sold Raised Method      Day
  <date>     <chr>                <dbl>  <dbl> <chr>     <dbl>
1 2022-11-08 4                       NA     NA Trad          8
2 2022-11-09 6                       NA     NA Trad          9
3 2022-11-15 Check Dataframe         NA     NA Attention    15
4 2022-11-16 3                       NA     NA Unknown      16

Or with data.table

library(data.table)
setDT(df1)[, c("Category", "Method") := if(.N > 1)
    .("Check Dataframe", "Attention") else .(Category, Method), Date]
df2 <- unique(df1, by  = 'Date')

-output

> df2
         Date        Category Units_Sold Raised    Method   Day
       <Date>          <char>      <num>  <num>    <char> <num>
1: 2022-11-08               4         NA     NA      Trad     8
2: 2022-11-09               6         NA     NA      Trad     9
3: 2022-11-15 Check Dataframe         NA     NA Attention    15
4: 2022-11-16               3         NA     NA   Unknown    16
  • Related