After calculating the average cost per city from 1900-1910, I need to delete all cities which are missing data between 1860 and 1863. Here's the data for that timespan...
city cost1860 cost1861 cost1862 cost1863 cost1864
1 Boston NA NA NA NA NA
2 Los Angeles 1.77643659 3.516253 1.683492 3.573637296 4.4076780
3 Detroit NA NA NA NA NA
4 New York City NA NA NA NA NA
5 Chicago 32.87500913 39.785973 35.471498 24.683812800 19.5488509
6 Memphis NA NA NA NA NA
7 Seattle NA NA NA NA NA
8 St. Louis -0.01007441 4.659959 NA 0.005722915 NA
9 Boulder NA NA NA NA NA
10 Boise NA NA NA NA NA
Now, there's also data in separate columns for the following years, but I need to figure out a way to remove the cities which have any NA values between 1860 and 1863, without also removing all the data for the following years. So, once this is done, I should have only cities with data between 1860 and 1863 (as well as data for the following years, which may have NA values).
I've been able to remove cities with missing data between 1860 and 1863, but I can't figure out how to do so without also removing all data from following years. This is my code for that...
na.exclude(mydata[, 2:5])
mydata_1860_1863 <- na.exclude(mydata[, 2:5])
Does anyone know how I could remove cities missing data between 1860 and 1863 while also retaining data for the following years?
CodePudding user response:
It's difficult to remove rows but retain columns. Rather than removal, why not flag those rows where 1860-1863 data is missing, so you can filter on them later?
For example:
library(dplyr)
mydata <- mydata %>%
mutate(is_missing = ifelse(is.na(rowSums(.[, 2:5])), 1, 0))
Result:
city cost1860 cost1861 cost1862 cost1863 cost1864 is_missing
1 Boston NA NA NA NA NA 1
2 Los Angeles 1.77643659 3.516253 1.683492 3.573637296 4.407678 0
3 Detroit NA NA NA NA NA 1
4 New York City NA NA NA NA NA 1
5 Chicago 32.87500913 39.785973 35.471498 24.683812800 19.548851 0
6 Memphis NA NA NA NA NA 1
7 Seattle NA NA NA NA NA 1
8 St. Louis -0.01007441 4.659959 NA 0.005722915 NA 1
9 Boulder NA NA NA NA NA 1
10 Boise NA NA NA NA NA 1
CodePudding user response:
Here a solution based on data.table
:
library(data.table)
dt <- data.table::data.table(city = c("Boston","Los Angeles", "Detroit","New York City","Chicago","Memphis","Seattle", "St. Louis","Boulder","Boise"), cost1860 = c(NA,1.77643659,NA,NA, 32.87500913,NA,NA,-0.01007441,NA,NA), cost1861 = c(NA,3.516253,NA,NA,39.785973, NA,NA,4.659959,NA,NA), cost1862 = c(NA, 1.683492, NA, NA, 35.471498, NA, NA, NA, NA, NA), cost1863 = c(NA,3.573637296,NA,NA, 24.6838128,NA,NA,0.005722915,NA,NA), cost1864 = c(NA, 4.407678, NA, NA, 19.5488509, NA, NA, NA, NA, NA) )
dt[dt[,!is.na(rowSums(.SD)),.SDcols=-c(1,6)]]
#> city cost1860 cost1861 cost1862 cost1863 cost1864
#> 1: Los Angeles 1.776437 3.516253 1.683492 3.573637 4.407678
#> 2: Chicago 32.875009 39.785973 35.471498 24.683813 19.548851
Now, a tidyverse
approach:
library(tidyverse)
df <- data.frame(stringsAsFactors = FALSE, city = c("Boston", "Los Angeles","Detroit","New York City","Chicago", "Memphis","Seattle","St. Louis","Boulder","Boise"), cost1860 = c(NA,1.77643659,NA, NA,32.87500913,NA,NA,-0.01007441,NA,NA), cost1861 = c(NA,3.516253,NA, NA,39.785973,NA,NA,4.659959,NA,NA), cost1862 = c(NA,1.683492,NA, NA,35.471498,NA,NA,NA,NA,NA), cost1863 = c(NA,3.573637296,NA, NA,24.6838128,NA,NA,0.005722915,NA,NA), cost1864 = c(NA,4.407678,NA, NA,19.5488509,NA,NA,NA,NA,NA))
df %>%
filter(across(2:5, ~ !is.na(.x)))
#> city cost1860 cost1861 cost1862 cost1863 cost1864
#> 1 Los Angeles 1.776437 3.516253 1.683492 3.573637 4.407678
#> 2 Chicago 32.875009 39.785973 35.471498 24.683813 19.548851