Home > OS >  Join data frame into one in r
Join data frame into one in r

Time:09-30

I have 4 data frames that all look like this:

Product 2018 Number Minimum Maximum
1 56 1 5
2 42 12 16
3 6523 23 56
4 123 23 102
5 56 23 64
6 245623 56 87
7 546 25 540
8 54566 253 560
Product 2019 Number Minimum Maximum
1 56 32 53
2 642 423 620
3 56423 432 560
4 3 431 802
5 2 2 6
6 4523 43 68
7 555 23 54
8 55646 3 6
Product 2020 Number Minimum Maximum
1 23 2 5
2 342 4 16
3 223 3 5
4 13 4 12
5 2 4 7
6 223 7 8
7 5 34 50
8 46 3 6
Product 2021 Number Minimum Maximum
1 234 3 5
2 3242 4 16
3 2423 43 56
4 123 43 102
5 24 4 6
6 2423 4 18
7 565 234 540
8 5646 23 56

I want to join all the tables so I get a table that looks like this:

Products Number 2021 Min-Max 2021 Number 2020 Min-Max 2020 Number 2019 Min-Max 2019 Number 2018 Min-Max 2018
1 234 3 to 5 23 2 to 5 ... ... ... ...
2 3242 4 to 16 342 4 to 16 ... ... ... ...
3 2423 43 to 56 223 3 to 5 ... ... ... ...
4 123 43 to 102 13 4 to 12 ... ... ... ...
5 24 4 to 6 2 4 to 7 ... ... ... ...
6 2423 4 to 18 223 7 to 8 ... ... ... ...
7 565 234 to 540 5 34 to 50 ... ... ... ...
8 5646 23 to 56 46 3 to 6 ... ... ... ...

The Product for all years are the same so I would like to have a data frame that contains the number for each year as a column and joins the column for minimum and maximum as one.

Any help is welcome!

CodePudding user response:

How about something like this. You are trying to join several dataframes by a single column, which is relatively straight forward using full_join. The difficulty is that you are trying to extract information from the column names and combine several columns at the same time. I would map out everying you want to do and then reduce the list of dataframes at the end. Here is an example with two dataframes, but you could add as many as you want to the list at the begining.

library(tidyverse)

#test data
set.seed(23)
df1 <- tibble("Product 2018" = seq(1:8), 
              Number = sample(1:100, 8), 
              Minimum = sample(1:100, 8),
              Maximum = map_dbl(Minimum, ~sample(.x:1000, 1)))

set.seed(46)
df2 <- tibble("Product 2019" = seq(1:8), 
              Number = sample(1:100, 8), 
              Minimum = sample(1:100, 8),
              Maximum = map_dbl(Minimum, ~sample(.x:1000, 1)))


list(df1, df2) |>
  map(\(x){
    year <- str_extract(colnames(x)[1], "\\d ?$")
    
    mutate(x, !!quo_name(paste0("Min-Max ", year)) := paste(Minimum, "to", Maximum))|>
      rename(!!quo_name(paste0("Number ", year)) := Number)|>
      rename_with(~gsub("\\s\\d ?$", "", .), 1) |>
      select(-c(Minimum, Maximum))
  }) |>
  reduce(full_join, by = "Product")
#> # A tibble: 8 x 5
#>   Product `Number 2018` `Min-Max 2018` `Number 2019` `Min-Max 2019`
#>     <int>         <int> <chr>                  <int> <chr>         
#> 1       1            29 21 to 481                 50 93 to 416     
#> 2       2            28 17 to 314                 78 7 to 313      
#> 3       3            72 40 to 787                  1 91 to 205     
#> 4       4            43 36 to 557                 47 55 to 542     
#> 5       5            45 70 to 926                 52 76 to 830     
#> 6       6            34 96 to 645                 70 20 to 922     
#> 7       7            48 31 to 197                 84 6 to 716      
#> 8       8            17 86 to 951                 99 75 to 768

CodePudding user response:

This is a similar answer, but includes bind_rows to combine the data.frames, then pivot_wider to end in a wide format.

The first steps strip the year from the Product XXXX column name, as this carries relevant information on year for that data.frame. If that column is renamed as Product they are easily combined (with a separate column containing the Year). If this step can be taken earlier in the data collection or processing timeline, it is helpful.

library(tidyverse)

list(df1, df2, df3, df4) %>%
  map(~.x %>%
        mutate(Year = gsub("Product", "", names(.x)[1])) %>%
        rename(Product = !!names(.[1]))) %>%
  bind_rows() %>%
  mutate(Min_Max = paste(Minimum, Maximum, sep = " to ")) %>%
  pivot_wider(id_cols = Product, names_from = Year, values_from = c(Number, Min_Max), names_vary = "slowest")

Output

  Product Number_2018 Min_Max_2018 Number_2019 Min_Max_2019 Number_2020 Min_Max_2020 Number_2021 Min_Max_2021
    <int>       <int> <chr>              <int> <chr>              <int> <chr>              <int> <chr>       
1       1          56 1 to 5                56 32 to 53              23 2 to 5               234 3 to 5      
2       2          42 12 to 16             642 423 to 620           342 4 to 16             3242 4 to 16     
3       3        6523 23 to 56           56423 432 to 560           223 3 to 5              2423 43 to 56    
4       4         123 23 to 102              3 431 to 802            13 4 to 12              123 43 to 102   
5       5          56 23 to 64               2 2 to 6                 2 4 to 7                24 4 to 6      
6       6      245623 56 to 87            4523 43 to 68             223 7 to 8              2423 4 to 18     
7       7         546 25 to 540            555 23 to 54               5 34 to 50             565 234 to 540  
8       8       54566 253 to 560         55646 3 to 6                46 3 to 6              5646 23 to 56  
  • Related