Deleting rows with specified condition-CodePudding

After calculating the average cost per city from 1900-1910, I need to delete all cities which are missing data between 1860 and 1863. Here's the data for that timespan...

                    city    cost1860  cost1861  cost1862     cost1863   cost1864
1                 Boston          NA        NA        NA           NA         NA
2            Los Angeles  1.77643659  3.516253  1.683492  3.573637296  4.4076780
3                Detroit          NA        NA        NA           NA         NA
4          New York City          NA        NA        NA           NA         NA
5                Chicago 32.87500913 39.785973 35.471498 24.683812800 19.5488509
6                Memphis          NA        NA        NA           NA         NA
7                Seattle          NA        NA        NA           NA         NA
8              St. Louis -0.01007441  4.659959        NA  0.005722915         NA
9                Boulder          NA        NA        NA           NA         NA
10                 Boise          NA        NA        NA           NA         NA

Now, there's also data in separate columns for the following years, but I need to figure out a way to remove the cities which have any NA values between 1860 and 1863, without also removing all the data for the following years. So, once this is done, I should have only cities with data between 1860 and 1863 (as well as data for the following years, which may have NA values).

I've been able to remove cities with missing data between 1860 and 1863, but I can't figure out how to do so without also removing all data from following years. This is my code for that...

na.exclude(mydata[, 2:5])

mydata_1860_1863 <- na.exclude(mydata[, 2:5])

Does anyone know how I could remove cities missing data between 1860 and 1863 while also retaining data for the following years?

CodePudding user response：

It's difficult to remove rows but retain columns. Rather than removal, why not flag those rows where 1860-1863 data is missing, so you can filter on them later?

For example:

library(dplyr)
mydata <- mydata %>% 
  mutate(is_missing = ifelse(is.na(rowSums(.[, 2:5])), 1, 0))

Result:

            city    cost1860  cost1861  cost1862     cost1863  cost1864 is_missing
1         Boston          NA        NA        NA           NA        NA          1
2    Los Angeles  1.77643659  3.516253  1.683492  3.573637296  4.407678          0
3        Detroit          NA        NA        NA           NA        NA          1
4  New York City          NA        NA        NA           NA        NA          1
5        Chicago 32.87500913 39.785973 35.471498 24.683812800 19.548851          0
6        Memphis          NA        NA        NA           NA        NA          1
7        Seattle          NA        NA        NA           NA        NA          1
8      St. Louis -0.01007441  4.659959        NA  0.005722915        NA          1
9        Boulder          NA        NA        NA           NA        NA          1
10         Boise          NA        NA        NA           NA        NA          1

CodePudding user response：

Here a solution based on data.table:

library(data.table)

dt <- data.table::data.table(city = c("Boston","Los Angeles",    "Detroit","New York City","Chicago","Memphis","Seattle",    "St. Louis","Boulder","Boise"), cost1860 = c(NA,1.77643659,NA,NA,  32.87500913,NA,NA,-0.01007441,NA,NA), cost1861 = c(NA,3.516253,NA,NA,39.785973,  NA,NA,4.659959,NA,NA), cost1862 = c(NA, 1.683492, NA, NA, 35.471498, NA, NA, NA, NA, NA), cost1863 = c(NA,3.573637296,NA,NA,  24.6838128,NA,NA,0.005722915,NA,NA), cost1864 = c(NA, 4.407678, NA, NA, 19.5488509, NA, NA, NA, NA, NA) ) 

dt[dt[,!is.na(rowSums(.SD)),.SDcols=-c(1,6)]]

#>           city  cost1860  cost1861  cost1862  cost1863  cost1864
#> 1: Los Angeles  1.776437  3.516253  1.683492  3.573637  4.407678
#> 2:     Chicago 32.875009 39.785973 35.471498 24.683813 19.548851

Now, a tidyverse approach:

library(tidyverse)

df <- data.frame(stringsAsFactors = FALSE, city = c("Boston", "Los Angeles","Detroit","New York City","Chicago", "Memphis","Seattle","St. Louis","Boulder","Boise"), cost1860 = c(NA,1.77643659,NA, NA,32.87500913,NA,NA,-0.01007441,NA,NA), cost1861 = c(NA,3.516253,NA, NA,39.785973,NA,NA,4.659959,NA,NA), cost1862 = c(NA,1.683492,NA, NA,35.471498,NA,NA,NA,NA,NA), cost1863 = c(NA,3.573637296,NA, NA,24.6838128,NA,NA,0.005722915,NA,NA), cost1864 = c(NA,4.407678,NA, NA,19.5488509,NA,NA,NA,NA,NA))

df %>% 
  filter(across(2:5, ~ !is.na(.x)))

#>          city  cost1860  cost1861  cost1862  cost1863  cost1864
#> 1 Los Angeles  1.776437  3.516253  1.683492  3.573637  4.407678
#> 2     Chicago 32.875009 39.785973 35.471498 24.683813 19.548851