Recently I have started using R again. I am having problems with the following problem:
I have created a dataframe format in which I want to merge smaller dataframes into, based on both the columns and rows. I want to keep the dimensions of the large format dataframe and insert the data of the smaller dataframes. The tables that I am working with are quite big, but I have created some example data.
The large dataframe, which is an empty format table:
formattable <- data.frame("Agriculture" = c(NA,NA,NA,NA,NA,NA,NA,NA),"Mining" = c(NA,NA,NA,NA,NA,NA,NA,NA),
"Industry" = c(NA,NA,NA,NA,NA,NA,NA,NA),"Services" = c(NA,NA,NA,NA,NA,NA,NA,NA), "Households" = c(NA,NA,NA,NA,NA,NA,NA,NA),
row.names = c("N01 water","N02 solar", "P01 coal", "P02 natural gas", "P03 diesel", "R01 waste", "R02 residuals","R03 losses" ) )
The smaller dataframes that hold data:
data1 <- data.frame("Agriculture" = c(5,7,NA),"Mining" = c(5,6,NA),
,"Services" = c(4,8,NA), "Households" = c(1,11,NA),
row.names = c("P01 coal", "P02 natural gas", "P03 diesel") )
data2 <- data.frame("Agriculture" = c(3,2),"Mining" = c(NA,5)
,"Services" = c(5,7),
row.names = c("N01 water","N02 solar") )
data3 <- data.frame("Agriculture" = c(3,6,5),"Mining" = c(NA,0,5),
"Industry" = c(5,NA,NA),"Services" = c(NA,NA,NA), "Households" = c(7,8,9),
row.names = c("R01 waste", "R02 residuals","R03 losses" ) )
As an end product, I want one large dataframe that holds all the data from the smaller dataframes. The combinations of rows and columns that are empty/zero should still be visible with the value "0". Can anyone help me with this?
CodePudding user response:
I would do this one with tidyverse
functions:
library(dplyr)
library(tibble)
library(tidyr)
formattable |>
rownames_to_column() |>
rows_patch(rownames_to_column(data1), by = "rowname") |>
rows_patch(rownames_to_column(data2), by = "rowname") |>
rows_patch(rownames_to_column(data3), by = "rowname") |>
mutate(across(
Agriculture:Households, \(col) replace_na(col, 0)
)
) |>
column_to_rownames()
# Agriculture Mining Industry Services Households
# N01 water 3 0 0 5 0
# N02 solar 2 5 0 7 0
# P01 coal 5 5 0 4 1
# P02 natural gas 7 6 0 8 11
# P03 diesel 0 0 0 0 0
# R01 waste 3 0 5 0 7
# R02 residuals 6 0 0 0 8
# R03 losses 5 5 0 0 9
EDIT: Using Reduce
If you are using more than three data frames to update the original, you might want to put them in a list
and use Reduce
to avoid having to repeat yourself:
# Note use of `rownames_to_column()` on first data frame
l <- list(rownames_to_column(formattable), data1, data2, data3)
Reduce(\(x, y) rows_patch(
x,
rownames_to_column(y),
by = "rowname"),
l) |>
mutate(across(
Agriculture:Households, \(col) replace_na(col, 0)
)
) |>
column_to_rownames()
# Agriculture Mining Industry Services Households
# N01 water 3 0 0 5 0
# N02 solar 2 5 0 7 0
# P01 coal 5 5 0 4 1
# P02 natural gas 7 6 0 8 11
# P03 diesel 0 0 0 0 0
# R01 waste 3 0 5 0 7
# R02 residuals 6 0 0 0 8
# R03 losses 5 5 0 0 9
You will need at least dplyr v.1.0.0
for rows_patch()
. From the docs:
rows_update() modifies existing rows (like UPDATE). Key values in y must be unique, and, by default, key values in y must exist in x.
rows_patch() works like rows_update() but only overwrites NA values.
CodePudding user response:
Another approach could potentially be to skip the larger data frame and use bind_rows
and replace_na
on the smaller:
library(dplyr)
bind_rows(data1, data2, data3) |>
mutate(across(everything(), ~ replace_na(., 0)))
Output:
Agriculture Mining Services Households Industry
P01 coal 5 5 4 1 0
P02 natural gas 7 6 8 11 0
P03 diesel 0 0 0 0 0
N01 water 3 0 5 0 0
N02 solar 2 5 7 0 0
R01 waste 3 0 0 7 5
R02 residuals 6 0 0 8 0
R03 losses 5 5 0 9 0