Table functions to match and merge based on different criteria-CodePudding

There are two tables, table_a and table_b. One table has monthly data while the other table has yearly data (although both tables have a 'year' column / both tables may have a different year range). Example:

           table_a                         table_b
     year   month   infltn               year   h_index
[1]  2010   May     4.0%            [1]  2011   102.0
[2]  2010   Jun     4.0%            [2]  2012   102.5
[3]  2010   Jul     4.0%            [3]  2013   103.0
[4]  2010   Aug     4.0%            [4]  2014   103.6
[5]  2010   Sep     4.0%            [5]  2015   104.1
[6]  2010   Oct     3.0%            [6]  2016   104.6
[7]  2010   Nov     3.0%            [7]  2017   105.1
[8]  2010   Dec     3.0%            [8]  2018   105.6
[9]  2011   Jan     4.0%            [9]  2019   106.2
[10] 2011   Feb     4.0%            [10] 2020   106.7
[11] 2011   Mar     4.0%            [11] 2021   107.2
[12] 2011   Apr     5.0%            [12] 2018   107.8
[13] 2011   May     5.0%            [13] 2019   108.3
[14] 2011   Jun     5.0%            [14] 2020   108.8
[15] 2011   Jul     5.0%            [15] 2021   109.4

The first part of the question/request: if creating a new_table_a (based on the yearly data of table_a) the data from table_b needs to be interpolated to populate the new table. As an example the column h_index in table_a has been interpolated into new_table_a as h_index2 (see below, new_table_a). Also, please note that in the example the data started in 2011 so prior to 2011 that h_index2 shows the character '-'.

               new_table_a              
     year   month   infltn  h_index2
[1]  2010   May       4%        -
[2]  2010   Jun       4%        -
[3]  2010   Jul       4%        -
[4]  2010   Aug       4%        -
[5]  2010   Sep       4%        -
[6]  2010   Oct       3%        -
[7]  2010   Nov       3%        -
[8]  2010   Dec       3%        -
[9]  2011   Jan       4%        102.0%
[10] 2011   Feb       4%        102.1%
[11] 2011   Mar       4%        102.1%
[12] 2011   Apr       5%        102.2%
[13] 2011   May       5%        102.2%
[14] 2011   Jun       5%        102.3%
[15] 2011   Jul       5%        102.3%

The second part of the question/request: if creating a new_table_b based on the monthly data of table_b, how do you average the calender months for the respective year so that you can populate new_table_b with an average (from infltn in table_a to infltn_avg in new_table_b). Example:

new_table_b             
     Year   h_index     infltn_avg
[1]  2011   102.0       4.0%
[2]  2012   102.5       4.0%
[3]  2013   103.0       4.0%
[4]  2014   103.6       4.0%
[5]  2015   104.1       4.0%
[6]  2016   104.6       3.0%
[7]  2017   105.1       3.0%
[8]  2018   105.6       3.0%
[9]  2019   106.2       4.0%
[10] 2020   106.7       4.0%
[11] 2021   107.2       4.0%
[12] 2018   107.8       5.0%
[13] 2019   108.3       5.0%
[14] 2020   108.8       5.0%
[15] 2021   109.4       5.0%

I have tried to code this using a range of loops and if's, which would seem rational in something like VBA, although it a bit clunky in R (and doesnt make much use of the tidyverse packages, which I am still learning) and I am sure that there is a more intelligent way of approaching it.

Thank you for reading!

CodePudding user response：

Here is a way to achieve this:

table_b$month ="Jan"
new_table_a = table_a %>% left_join(table_b, by =c("year", "month"))
new_table_a$h_index2 = na.approx(new_table_a$h_index, na.rm = FALSE)
new_table_a$infltn = as.numeric(sub("%", "", new_table_a$infltn))
new_table_b = new_table_a %>% group_by(year) %>% summarise(h_index2= min(h_index2), infltn_avg = mean(infltn))

new_table_a

year month infltn h_index h_index2
1  2010   May      4      NA       NA
2  2010   Jun      4      NA       NA
3  2010   Jul      4      NA       NA
4  2010   Aug      4      NA       NA
5  2010   Sep      4      NA       NA
6  2010   Oct      3      NA       NA
7  2010   Nov      3      NA       NA
8  2010   Dec      3      NA       NA
9  2011   Jan      4   102.0 102.0000
10 2011   Feb      4      NA 102.0417
11 2011   Mar      4      NA 102.0833
12 2011   Apr      5      NA 102.1250
13 2011   May      5      NA 102.1667
14 2011   Jun      5      NA 102.2083
15 2011   Jul      5      NA 102.2500
16 2011   Aug      5      NA 102.2917
17 2011   Sep      5      NA 102.3333
18 2011   Oct      5      NA 102.3750
19 2011   Nov      5      NA 102.4167
20 2011   Dec      5      NA 102.4583
21 2012   Jan      5   102.5 102.5000

new_table_b

# A tibble: 6 x 3
   year h_index2 infltn_avg
  <int>    <dbl>      <dbl>
1  2010      NA        3.62
2  2011     102        4.75
3  2012     102.       5   
4  2013     103        5   
5  2014     104.       6   
6  2015     104.       6

Details

First, to facilitate the extrapolation of h_index we add a month column in the table_b set to "Jan"

table_b$month ="Jan"

Then we join the two data frames by year and month

new_table_a = table_a %>% left_join(table_b, by =c("year", "month"))

For the extrapolation you can use na.approx from the zoo package. h_index2 is the extrapolation between 2 known h_index. Note: As this column is numeric it's best to keep the NA rather than adding a "-".

library(zoo)
new_table_a$h_index2 = na.approx(new_table_a$h_index, na.rm = FALSE)

Now to generate new_table_b, we need the average percentage per year. For this we have to transform the infltn column into numeric value by removing the "%":

new_table_a$infltn = as.numeric(sub("%", "", new_table_a$infltn))

Finally, we just need to summarize the new_table_a to have the new_table_b. After grouping by year, we take the lowest value for min(h_index2) and do the average of monthly inflation mean(infltn)

new_table_b = new_table_a %>% group_by(year) %>% summarize(h_index2= min(h_index2), infltn_avg = mean(infltn))

Data

table_a = read.table(text="year month infltn
1  2010   May  4.00%
2  2010   Jun  4.00%
3  2010   Jul  4.00%
4  2010   Aug  4.00%
5  2010   Sep  4.00%
6  2010   Oct  3.00%
7  2010   Nov  3.00%
8  2010   Dec  3.00%
9  2011   Jan  4.00%
10 2011   Feb  4.00%
11 2011   Mar  4.00%
12 2011   Apr  5.00%
13 2011   May  5.00%
14 2011   Jun  5.00%
15 2011   Jul  5.00%
16 2011   Aug  5.00%
17 2011   Sep  5.00%
18 2011   Oct  5.00%
19 2011   Nov  5.00%
20 2011   Dec  5.00%
21 2012   Jan  5.00%
22 2012   Feb  5.00%
23 2012   Mar  5.00%
24 2012   Apr  5.00%
25 2012   May  5.00%
26 2012   Jun  5.00%
27 2012   Jul  5.00%
28 2012   Aug  5.00%
29 2012   Sep  5.00%
30 2012   Oct  5.00%
31 2012   Nov  5.00%
32 2012   Dec  5.00%
33 2013   Jan  5.00%
34 2013   Feb  5.00%
35 2013   Mar  5.00%
36 2013   Apr  5.00%
37 2013   May  5.00%
38 2013   Jun  5.00%
39 2013   Jul  5.00%
40 2013   Aug  5.00%
41 2013   Sep  5.00%
42 2013   Oct  5.00%
43 2013   Nov  5.00%
44 2013   Dec  5.00%
45 2014   Jan  6.00%
46 2014   Feb  6.00%
47 2014   Mar  6.00%
48 2014   Apr  6.00%
49 2014   May  6.00%
50 2014   Jun  6.00%
51 2014   Jul  6.00%
52 2014   Aug  6.00%
53 2014   Sep  6.00%
54 2014   Oct  6.00%
55 2014   Nov  6.00%
56 2014   Dec  6.00%
57 2015   Jan  6.00%", header= TRUE)
table_b = read.table(text="year h_index
1  2011   102.0
2  2012   102.5
3  2013   103.0
4  2014   103.6
5  2015   104.1
6  2016   104.6
7  2017   105.1
8  2018   105.6
9  2019   106.2
10 2020   106.7
11 2021   107.2
12 2018   107.8
13 2019   108.3
14 2020   108.8
15 2021   109.4", header = TRUE )