There are two tables, table_a and table_b. One table has monthly data while the other table has yearly data (although both tables have a 'year' column / both tables may have a different year range). Example:
table_a table_b
year month infltn year h_index
[1] 2010 May 4.0% [1] 2011 102.0
[2] 2010 Jun 4.0% [2] 2012 102.5
[3] 2010 Jul 4.0% [3] 2013 103.0
[4] 2010 Aug 4.0% [4] 2014 103.6
[5] 2010 Sep 4.0% [5] 2015 104.1
[6] 2010 Oct 3.0% [6] 2016 104.6
[7] 2010 Nov 3.0% [7] 2017 105.1
[8] 2010 Dec 3.0% [8] 2018 105.6
[9] 2011 Jan 4.0% [9] 2019 106.2
[10] 2011 Feb 4.0% [10] 2020 106.7
[11] 2011 Mar 4.0% [11] 2021 107.2
[12] 2011 Apr 5.0% [12] 2018 107.8
[13] 2011 May 5.0% [13] 2019 108.3
[14] 2011 Jun 5.0% [14] 2020 108.8
[15] 2011 Jul 5.0% [15] 2021 109.4
The first part of the question/request: if creating a new_table_a (based on the yearly data of table_a) the data from table_b needs to be interpolated to populate the new table. As an example the column h_index in table_a has been interpolated into new_table_a as h_index2 (see below, new_table_a). Also, please note that in the example the data started in 2011 so prior to 2011 that h_index2 shows the character '-'.
new_table_a
year month infltn h_index2
[1] 2010 May 4% -
[2] 2010 Jun 4% -
[3] 2010 Jul 4% -
[4] 2010 Aug 4% -
[5] 2010 Sep 4% -
[6] 2010 Oct 3% -
[7] 2010 Nov 3% -
[8] 2010 Dec 3% -
[9] 2011 Jan 4% 102.0%
[10] 2011 Feb 4% 102.1%
[11] 2011 Mar 4% 102.1%
[12] 2011 Apr 5% 102.2%
[13] 2011 May 5% 102.2%
[14] 2011 Jun 5% 102.3%
[15] 2011 Jul 5% 102.3%
The second part of the question/request: if creating a new_table_b based on the monthly data of table_b, how do you average the calender months for the respective year so that you can populate new_table_b with an average (from infltn in table_a to infltn_avg in new_table_b). Example:
new_table_b
Year h_index infltn_avg
[1] 2011 102.0 4.0%
[2] 2012 102.5 4.0%
[3] 2013 103.0 4.0%
[4] 2014 103.6 4.0%
[5] 2015 104.1 4.0%
[6] 2016 104.6 3.0%
[7] 2017 105.1 3.0%
[8] 2018 105.6 3.0%
[9] 2019 106.2 4.0%
[10] 2020 106.7 4.0%
[11] 2021 107.2 4.0%
[12] 2018 107.8 5.0%
[13] 2019 108.3 5.0%
[14] 2020 108.8 5.0%
[15] 2021 109.4 5.0%
I have tried to code this using a range of loops and if's, which would seem rational in something like VBA, although it a bit clunky in R (and doesnt make much use of the tidyverse packages, which I am still learning) and I am sure that there is a more intelligent way of approaching it.
Thank you for reading!
CodePudding user response:
Here is a way to achieve this:
table_b$month ="Jan"
new_table_a = table_a %>% left_join(table_b, by =c("year", "month"))
new_table_a$h_index2 = na.approx(new_table_a$h_index, na.rm = FALSE)
new_table_a$infltn = as.numeric(sub("%", "", new_table_a$infltn))
new_table_b = new_table_a %>% group_by(year) %>% summarise(h_index2= min(h_index2), infltn_avg = mean(infltn))
new_table_a
year month infltn h_index h_index2
1 2010 May 4 NA NA
2 2010 Jun 4 NA NA
3 2010 Jul 4 NA NA
4 2010 Aug 4 NA NA
5 2010 Sep 4 NA NA
6 2010 Oct 3 NA NA
7 2010 Nov 3 NA NA
8 2010 Dec 3 NA NA
9 2011 Jan 4 102.0 102.0000
10 2011 Feb 4 NA 102.0417
11 2011 Mar 4 NA 102.0833
12 2011 Apr 5 NA 102.1250
13 2011 May 5 NA 102.1667
14 2011 Jun 5 NA 102.2083
15 2011 Jul 5 NA 102.2500
16 2011 Aug 5 NA 102.2917
17 2011 Sep 5 NA 102.3333
18 2011 Oct 5 NA 102.3750
19 2011 Nov 5 NA 102.4167
20 2011 Dec 5 NA 102.4583
21 2012 Jan 5 102.5 102.5000
new_table_b
# A tibble: 6 x 3
year h_index2 infltn_avg
<int> <dbl> <dbl>
1 2010 NA 3.62
2 2011 102 4.75
3 2012 102. 5
4 2013 103 5
5 2014 104. 6
6 2015 104. 6
Details
First, to facilitate the extrapolation of h_index
we add a month
column in the table_b
set to "Jan"
table_b$month ="Jan"
Then we join the two data frames by year
and month
new_table_a = table_a %>% left_join(table_b, by =c("year", "month"))
For the extrapolation you can use na.approx
from the zoo
package. h_index2
is the extrapolation between 2 known h_index
. Note: As this column is numeric it's best to keep the NA rather than adding a "-".
library(zoo)
new_table_a$h_index2 = na.approx(new_table_a$h_index, na.rm = FALSE)
Now to generate new_table_b
, we need the average percentage per year. For this we have to transform the infltn
column into numeric value by removing the "%":
new_table_a$infltn = as.numeric(sub("%", "", new_table_a$infltn))
Finally, we just need to summarize the new_table_a
to have the new_table_b
. After grouping by year, we take the lowest value for min(h_index2)
and do the average of monthly inflation mean(infltn)
new_table_b = new_table_a %>% group_by(year) %>% summarize(h_index2= min(h_index2), infltn_avg = mean(infltn))
Data
table_a = read.table(text="year month infltn
1 2010 May 4.00%
2 2010 Jun 4.00%
3 2010 Jul 4.00%
4 2010 Aug 4.00%
5 2010 Sep 4.00%
6 2010 Oct 3.00%
7 2010 Nov 3.00%
8 2010 Dec 3.00%
9 2011 Jan 4.00%
10 2011 Feb 4.00%
11 2011 Mar 4.00%
12 2011 Apr 5.00%
13 2011 May 5.00%
14 2011 Jun 5.00%
15 2011 Jul 5.00%
16 2011 Aug 5.00%
17 2011 Sep 5.00%
18 2011 Oct 5.00%
19 2011 Nov 5.00%
20 2011 Dec 5.00%
21 2012 Jan 5.00%
22 2012 Feb 5.00%
23 2012 Mar 5.00%
24 2012 Apr 5.00%
25 2012 May 5.00%
26 2012 Jun 5.00%
27 2012 Jul 5.00%
28 2012 Aug 5.00%
29 2012 Sep 5.00%
30 2012 Oct 5.00%
31 2012 Nov 5.00%
32 2012 Dec 5.00%
33 2013 Jan 5.00%
34 2013 Feb 5.00%
35 2013 Mar 5.00%
36 2013 Apr 5.00%
37 2013 May 5.00%
38 2013 Jun 5.00%
39 2013 Jul 5.00%
40 2013 Aug 5.00%
41 2013 Sep 5.00%
42 2013 Oct 5.00%
43 2013 Nov 5.00%
44 2013 Dec 5.00%
45 2014 Jan 6.00%
46 2014 Feb 6.00%
47 2014 Mar 6.00%
48 2014 Apr 6.00%
49 2014 May 6.00%
50 2014 Jun 6.00%
51 2014 Jul 6.00%
52 2014 Aug 6.00%
53 2014 Sep 6.00%
54 2014 Oct 6.00%
55 2014 Nov 6.00%
56 2014 Dec 6.00%
57 2015 Jan 6.00%", header= TRUE)
table_b = read.table(text="year h_index
1 2011 102.0
2 2012 102.5
3 2013 103.0
4 2014 103.6
5 2015 104.1
6 2016 104.6
7 2017 105.1
8 2018 105.6
9 2019 106.2
10 2020 106.7
11 2021 107.2
12 2018 107.8
13 2019 108.3
14 2020 108.8
15 2021 109.4", header = TRUE )