Join data frame into one in r-CodePudding

I have 4 data frames that all look like this:

Product 2018	Number	Minimum	Maximum
1	56	1	5
2	42	12	16
3	6523	23	56
4	123	23	102
5	56	23	64
6	245623	56	87
7	546	25	540
8	54566	253	560

Product 2019	Number	Minimum	Maximum
1	56	32	53
2	642	423	620
3	56423	432	560
4	3	431	802
5	2	2	6
6	4523	43	68
7	555	23	54
8	55646	3	6

Product 2020	Number	Minimum	Maximum
1	23	2	5
2	342	4	16
3	223	3	5
4	13	4	12
5	2	4	7
6	223	7	8
7	5	34	50
8	46	3	6

Product 2021	Number	Minimum	Maximum
1	234	3	5
2	3242	4	16
3	2423	43	56
4	123	43	102
5	24	4	6
6	2423	4	18
7	565	234	540
8	5646	23	56

I want to join all the tables so I get a table that looks like this:

Products	Number 2021	Min-Max 2021	Number 2020	Min-Max 2020	Number 2019	Min-Max 2019	Number 2018	Min-Max 2018
1	234	3 to 5	23	2 to 5	...	...	...	...
2	3242	4 to 16	342	4 to 16	...	...	...	...
3	2423	43 to 56	223	3 to 5	...	...	...	...
4	123	43 to 102	13	4 to 12	...	...	...	...
5	24	4 to 6	2	4 to 7	...	...	...	...
6	2423	4 to 18	223	7 to 8	...	...	...	...
7	565	234 to 540	5	34 to 50	...	...	...	...
8	5646	23 to 56	46	3 to 6	...	...	...	...

The Product for all years are the same so I would like to have a data frame that contains the number for each year as a column and joins the column for minimum and maximum as one.

Any help is welcome!

CodePudding user response：

How about something like this. You are trying to join several dataframes by a single column, which is relatively straight forward using full_join. The difficulty is that you are trying to extract information from the column names and combine several columns at the same time. I would map out everying you want to do and then reduce the list of dataframes at the end. Here is an example with two dataframes, but you could add as many as you want to the list at the begining.

library(tidyverse)

#test data
set.seed(23)
df1 <- tibble("Product 2018" = seq(1:8), 
              Number = sample(1:100, 8), 
              Minimum = sample(1:100, 8),
              Maximum = map_dbl(Minimum, ~sample(.x:1000, 1)))

set.seed(46)
df2 <- tibble("Product 2019" = seq(1:8), 
              Number = sample(1:100, 8), 
              Minimum = sample(1:100, 8),
              Maximum = map_dbl(Minimum, ~sample(.x:1000, 1)))


list(df1, df2) |>
  map(\(x){
    year <- str_extract(colnames(x)[1], "\\d ?$")
    
    mutate(x, !!quo_name(paste0("Min-Max ", year)) := paste(Minimum, "to", Maximum))|>
      rename(!!quo_name(paste0("Number ", year)) := Number)|>
      rename_with(~gsub("\\s\\d ?$", "", .), 1) |>
      select(-c(Minimum, Maximum))
  }) |>
  reduce(full_join, by = "Product")
#> # A tibble: 8 x 5
#>   Product `Number 2018` `Min-Max 2018` `Number 2019` `Min-Max 2019`
#>     <int>         <int> <chr>                  <int> <chr>         
#> 1       1            29 21 to 481                 50 93 to 416     
#> 2       2            28 17 to 314                 78 7 to 313      
#> 3       3            72 40 to 787                  1 91 to 205     
#> 4       4            43 36 to 557                 47 55 to 542     
#> 5       5            45 70 to 926                 52 76 to 830     
#> 6       6            34 96 to 645                 70 20 to 922     
#> 7       7            48 31 to 197                 84 6 to 716      
#> 8       8            17 86 to 951                 99 75 to 768

CodePudding user response：

This is a similar answer, but includes bind_rows to combine the data.frames, then pivot_wider to end in a wide format.

The first steps strip the year from the Product XXXX column name, as this carries relevant information on year for that data.frame. If that column is renamed as Product they are easily combined (with a separate column containing the Year). If this step can be taken earlier in the data collection or processing timeline, it is helpful.

library(tidyverse)

list(df1, df2, df3, df4) %>%
  map(~.x %>%
        mutate(Year = gsub("Product", "", names(.x)[1])) %>%
        rename(Product = !!names(.[1]))) %>%
  bind_rows() %>%
  mutate(Min_Max = paste(Minimum, Maximum, sep = " to ")) %>%
  pivot_wider(id_cols = Product, names_from = Year, values_from = c(Number, Min_Max), names_vary = "slowest")

Output

  Product Number_2018 Min_Max_2018 Number_2019 Min_Max_2019 Number_2020 Min_Max_2020 Number_2021 Min_Max_2021
    <int>       <int> <chr>              <int> <chr>              <int> <chr>              <int> <chr>       
1       1          56 1 to 5                56 32 to 53              23 2 to 5               234 3 to 5      
2       2          42 12 to 16             642 423 to 620           342 4 to 16             3242 4 to 16     
3       3        6523 23 to 56           56423 432 to 560           223 3 to 5              2423 43 to 56    
4       4         123 23 to 102              3 431 to 802            13 4 to 12              123 43 to 102   
5       5          56 23 to 64               2 2 to 6                 2 4 to 7                24 4 to 6      
6       6      245623 56 to 87            4523 43 to 68             223 7 to 8              2423 4 to 18     
7       7         546 25 to 540            555 23 to 54               5 34 to 50             565 234 to 540  
8       8       54566 253 to 560         55646 3 to 6                46 3 to 6              5646 23 to 56