I have a table:
currency | date | y |
---|---|---|
EUR | 2019-02-25 | 0,1 |
EUR | 2019-02-25 | 0,2 |
EUR | 2019-02-25 | 0,2 |
USD | 2019-02-25 | 0,1 |
USD | 2019-02-25 | 0,1 |
USD | 2019-02-25 | 0,15 |
RON | 2019-02-25 | 0,1 |
RON | 2019-02-25 | 0,2 |
RON | 2019-02-25 | 0,3 |
RON | 2019-02-25 | 0,1 |
RON | 2019-02-25 | 0,1 |
EUR | 2019-03-01 | 0,15 |
EUR | 2019-03-01 | 0,1 |
USD | 2019-03-01 | 0,1 |
USD | 2019-03-01 | 0,25 |
USD | 2019-03-01 | 0,3 |
RON | 2019-03-01 | 0,1 |
RON | 2019-03-01 | 0,1 |
RON | 2019-03-01 | 0,15 |
RON | 2019-03-01 | 0,1 |
RON | 2019-03-01 | 0,2 |
dt <- data.table(currency = c("EUR","EUR","EUR","USD","USD","USD", "RON","RON","RON","RON","RON","EUR","EUR","USD","USD","USD", "RON","RON","RON","RON","RON"), date = c("2019-02-25","2019-02-25","2019-02-25","2019-02-25","2019-02-25","2019-02-25", "2019-02-25","2019-02-25","2019-02-25","2019-02-25","2019-02-25","2019-03-01","2019-03-01","2019-03-01","2019-03-01","2019-03-01", "2019-03-01","2019-03-01","2019-03-01","2019-03-01","2019-03-01"), y = c("0,1","0,2","0,2","0,1","0,1","0,15","0,1","0,2","0,3","0,1","0,1","0,15","0,1","0,1","0,25","0,3","0,1","0,1","0,15","0,1","0,2")
dt
I need to add a column "x" in which instances will be ordered from 1 to max number of instances for currency (here maximum number is 5 (number of rows with currency RON)). And all types of currency should be enumerated from 1 to this maximum number. And if there is smaller number of variables for some currencies it should add rows where values for column "N" will be missing Na.
So, I need a code after which I could get the following table:
currency | date | y | N |
---|---|---|---|
EUR | 2019-02-25 | 0,1 | 1 |
EUR | 2019-02-25 | 0,2 | 2 |
EUR | 2019-02-25 | 0,2 | 3 |
EUR | 2019-02-25 | Na | 4 |
EUR | 2019-02-25 | Na | 5 |
USD | 2019-02-25 | 0,1 | 1 |
USD | 2019-02-25 | 0,1 | 2 |
USD | 2019-02-25 | 0,15 | 3 |
USD | 2019-02-25 | Na | 4 |
USD | 2019-02-25 | Na | 5 |
RON | 2019-02-25 | 0,1 | 1 |
RON | 2019-02-25 | 0,2 | 2 |
RON | 2019-02-25 | 0,3 | 3 |
RON | 2019-02-25 | 0,1 | 4 |
RON | 2019-02-25 | 0,1 | 5 |
EUR | 2019-03-01 | 0,15 | 1 |
EUR | 2019-03-01 | 0,1 | 2 |
EUR | 2019-03-01 | Na | 3 |
EUR | 2019-03-01 | Na | 4 |
EUR | 2019-03-01 | Na | 5 |
USD | 2019-03-01 | 0,1 | 1 |
USD | 2019-03-01 | 0,25 | 2 |
USD | 2019-03-01 | 0,3 | 3 |
USD | 2019-03-01 | Na | 4 |
USD | 2019-03-01 | Na | 5 |
RON | 2019-03-01 | 0,1 | 1 |
RON | 2019-03-01 | 0,1 | 2 |
RON | 2019-03-01 | 0,15 | 3 |
RON | 2019-03-01 | 0,1 | 4 |
RON | 2019-03-01 | 0,2 | 5 |
I have no idea how to do it, so it would be great if you could help! Thanks!
CodePudding user response:
This is a perfect opportunity for tidyr::complete
.
library(dplyr)
library(tidyr)
dat |>
group_by(currency, date) |>
mutate(N = row_number()) |>
ungroup() |>
complete(currency, date, N) |>
arrange(date, currency, N)
# # A tibble: 30 x 4
# currency date N y
# <chr> <chr> <int> <chr>
# 1 EUR 2019-02-25 1 0,1
# 2 EUR 2019-02-25 2 0,2
# 3 EUR 2019-02-25 3 0,2
# 4 EUR 2019-02-25 4 NA
# 5 EUR 2019-02-25 5 NA
# 6 RON 2019-02-25 1 0,1
# 7 RON 2019-02-25 2 0,2
# 8 RON 2019-02-25 3 0,3
# 9 RON 2019-02-25 4 0,1
# 10 RON 2019-02-25 5 0,1
# # ... with 20 more rows
CodePudding user response:
You could use the rle
function provided in base
r like so:
instances = rle(dt$currency)
dt$N = unlist(sapply(instances$lengths,function(x) 1:x))
RLE stands for run-length encoding. The function returns the data value and counts of successive occurrences or each 'run' of values in a vector. Once we have this we access the counts through the lengths
element of instances
.