Home > OS >  How to Filter out specific rows in a data frame using dplyr?
How to Filter out specific rows in a data frame using dplyr?

Time:12-29

I have several sheets that I import from excel. While these sheets are similar there are some differences due to manual entry. I am trying to filter out the rows that has "Total" and anything beyond that row. The logic I have works for df1 and df3 but I am not sure how to get it to work for df2. Could someone please help?

df1<-structure(list(...1 = structure(c(1630022400, 1630108800, 1630195200, 
1630281600, 1630368000, NA), tzone = "UTC", class = c("POSIXct", 
"POSIXt")), `Vinayak Trading` = c(1984.31, NA, NA, NA, NA, 2916.17
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))

df2<-structure(list(...1 = c("44526", "44527", "44528", "44529", "44530", 
"Total"), `Vinayak Trading` = c(NA, NA, NA, NA, NA, 0)), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))

df3<-structure(list(...1 = c("44680", "44681", NA, "Total", NA, NA
), `Vinayak Trading` = c(NA, NA, NA, 2736.42, NA, NA)), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))

transform <- function(df) {
  names(df)[1] <- "Date"
  df <- df %>% 
  filter(row_number() < which(is.na(Date))) %>%  #To tackle sheets where Total is not present
  filter(row_number() < which(Date=="Total")) %>% #To remove Total in sheets where it is present
  select(-Total) #To remove the Total column 
}

  df1 <- transform(df1)# Desired reults
  df2 <- transform(df2)# Error due to no NAs - don't know how to handle
  df3 <- transform(df3)#Desired result with warning

CodePudding user response:

We can use dplyr::cumany to remove "Total" (or NA) and anything beyond.

transform2 <- function(df) {
  df %>%
    rename(Date = 1) %>%
    filter(!cumany(Date %in% c("Total", NA))) %>%
    select(-any_of("Total"))
}
 
transform2(df1)
# # A tibble: 5 × 2
#   Date                `Vinayak Trading`
#   <dttm>                          <dbl>
# 1 2021-08-27 00:00:00             1984.
# 2 2021-08-28 00:00:00               NA 
# 3 2021-08-29 00:00:00               NA 
# 4 2021-08-30 00:00:00               NA 
# 5 2021-08-31 00:00:00               NA 
transform2(df2)
# # A tibble: 5 × 2
#   Date  `Vinayak Trading`
#   <chr>             <dbl>
# 1 44526                NA
# 2 44527                NA
# 3 44528                NA
# 4 44529                NA
# 5 44530                NA
transform2(df3)
# # A tibble: 2 × 2
#   Date  `Vinayak Trading`
#   <chr>             <dbl>
# 1 44680                NA
# 2 44681                NA
  • We can use rename(Date = 1) as an inline replacement for names(df)[1] <- "Date", it seems a bit more pipe-esque;
  • == NA doesn't return true/false, but %in% NA does; we can use is.na(Date) | Date == "Total", or we can use Date %in% c("Total", NA) with the results one would expect;
  • cumany is a "cumulative any", meaning that when a value returns true, then all subsequent values will be true as well, see cumany(c(F,T,F)); for its opposite, see cumall(c(T,F,T)); for a base-R equivalent, use cumsum(cond) > 0 for cumany and cumsum(!cond) == 0 for cumall; and
  • I use select(-any_of("Total")) since it will remove the column if it exists and do nothing otherwise (none of your sample data included it, so I thought it better to be safe).
  • Related