Home > Back-end >  Merging dataframes into dataframe format by matching rows AND columns
Merging dataframes into dataframe format by matching rows AND columns

Time:07-22

Recently I have started using R again. I am having problems with the following problem:

I have created a dataframe format in which I want to merge smaller dataframes into, based on both the columns and rows. I want to keep the dimensions of the large format dataframe and insert the data of the smaller dataframes. The tables that I am working with are quite big, but I have created some example data.

The large dataframe, which is an empty format table:

formattable <- data.frame("Agriculture" = c(NA,NA,NA,NA,NA,NA,NA,NA),"Mining" = c(NA,NA,NA,NA,NA,NA,NA,NA),
                "Industry" = c(NA,NA,NA,NA,NA,NA,NA,NA),"Services" = c(NA,NA,NA,NA,NA,NA,NA,NA), "Households" = c(NA,NA,NA,NA,NA,NA,NA,NA), 
                row.names = c("N01 water","N02 solar", "P01 coal", "P02 natural gas", "P03 diesel", "R01 waste", "R02 residuals","R03 losses" ) )

The smaller dataframes that hold data:

data1 <- data.frame("Agriculture" = c(5,7,NA),"Mining" = c(5,6,NA),
                 ,"Services" = c(4,8,NA), "Households" = c(1,11,NA), 
                 row.names = c("P01 coal", "P02 natural gas", "P03 diesel") )

data2 <- data.frame("Agriculture" = c(3,2),"Mining" = c(NA,5)
                          ,"Services" = c(5,7), 
                          row.names = c("N01 water","N02 solar") )

data3 <- data.frame("Agriculture" = c(3,6,5),"Mining" = c(NA,0,5),
                          "Industry" = c(5,NA,NA),"Services" = c(NA,NA,NA), "Households" = c(7,8,9), 
                          row.names = c("R01 waste", "R02 residuals","R03 losses" ) )

As an end product, I want one large dataframe that holds all the data from the smaller dataframes. The combinations of rows and columns that are empty/zero should still be visible with the value "0". Can anyone help me with this?

CodePudding user response:

I would do this one with tidyverse functions:

library(dplyr)
library(tibble)
library(tidyr)

formattable  |>
    rownames_to_column()  |> 
    rows_patch(rownames_to_column(data1), by = "rowname")  |>
    rows_patch(rownames_to_column(data2), by = "rowname")  |>
    rows_patch(rownames_to_column(data3), by = "rowname")  |>
    mutate(across(
        Agriculture:Households, \(col) replace_na(col, 0)
        )
    ) |> 
    column_to_rownames()

#                 Agriculture Mining Industry Services Households
# N01 water                 3      0        0        5          0
# N02 solar                 2      5        0        7          0
# P01 coal                  5      5        0        4          1
# P02 natural gas           7      6        0        8         11
# P03 diesel                0      0        0        0          0
# R01 waste                 3      0        5        0          7
# R02 residuals             6      0        0        0          8
# R03 losses                5      5        0        0          9

EDIT: Using Reduce

If you are using more than three data frames to update the original, you might want to put them in a list and use Reduce to avoid having to repeat yourself:

# Note use of `rownames_to_column()` on first data frame
l  <- list(rownames_to_column(formattable), data1, data2, data3)

Reduce(\(x, y) rows_patch(
    x, 
    rownames_to_column(y), 
    by = "rowname"), 
    l)  |>
    mutate(across(
        Agriculture:Households, \(col) replace_na(col, 0)
        )
    ) |> 
    column_to_rownames()

#                 Agriculture Mining Industry Services Households
# N01 water                 3      0        0        5          0
# N02 solar                 2      5        0        7          0
# P01 coal                  5      5        0        4          1
# P02 natural gas           7      6        0        8         11
# P03 diesel                0      0        0        0          0
# R01 waste                 3      0        5        0          7
# R02 residuals             6      0        0        0          8
# R03 losses                5      5        0        0          9

You will need at least dplyr v.1.0.0 for rows_patch(). From the docs:

rows_update() modifies existing rows (like UPDATE). Key values in y must be unique, and, by default, key values in y must exist in x.

rows_patch() works like rows_update() but only overwrites NA values.

CodePudding user response:

Another approach could potentially be to skip the larger data frame and use bind_rows and replace_na on the smaller:

library(dplyr)

bind_rows(data1, data2, data3) |> 
  mutate(across(everything(), ~ replace_na(., 0)))

Output:

                Agriculture Mining Services Households Industry
P01 coal                  5      5        4          1        0
P02 natural gas           7      6        8         11        0
P03 diesel                0      0        0          0        0
N01 water                 3      0        5          0        0
N02 solar                 2      5        7          0        0
R01 waste                 3      0        0          7        5
R02 residuals             6      0        0          8        0
R03 losses                5      5        0          9        0
  • Related