Home > Net >  enumerate instances in R and add a new column
enumerate instances in R and add a new column

Time:07-06

I have a table:

currency date y
EUR 2019-02-25 0,1
EUR 2019-02-25 0,2
EUR 2019-02-25 0,2
USD 2019-02-25 0,1
USD 2019-02-25 0,1
USD 2019-02-25 0,15
RON 2019-02-25 0,1
RON 2019-02-25 0,2
RON 2019-02-25 0,3
RON 2019-02-25 0,1
RON 2019-02-25 0,1
EUR 2019-03-01 0,15
EUR 2019-03-01 0,1
USD 2019-03-01 0,1
USD 2019-03-01 0,25
USD 2019-03-01 0,3
RON 2019-03-01 0,1
RON 2019-03-01 0,1
RON 2019-03-01 0,15
RON 2019-03-01 0,1
RON 2019-03-01 0,2
dt <- data.table(currency = c("EUR","EUR","EUR","USD","USD","USD", "RON","RON","RON","RON","RON","EUR","EUR","USD","USD","USD", "RON","RON","RON","RON","RON"), date = c("2019-02-25","2019-02-25","2019-02-25","2019-02-25","2019-02-25","2019-02-25", "2019-02-25","2019-02-25","2019-02-25","2019-02-25","2019-02-25","2019-03-01","2019-03-01","2019-03-01","2019-03-01","2019-03-01", "2019-03-01","2019-03-01","2019-03-01","2019-03-01","2019-03-01"), y = c("0,1","0,2","0,2","0,1","0,1","0,15","0,1","0,2","0,3","0,1","0,1","0,15","0,1","0,1","0,25","0,3","0,1","0,1","0,15","0,1","0,2")
dt

I need to add a column "x" in which instances will be ordered from 1 to max number of instances for currency (here maximum number is 5 (number of rows with currency RON)). And all types of currency should be enumerated from 1 to this maximum number. And if there is smaller number of variables for some currencies it should add rows where values for column "N" will be missing Na.

So, I need a code after which I could get the following table:

currency date y N
EUR 2019-02-25 0,1 1
EUR 2019-02-25 0,2 2
EUR 2019-02-25 0,2 3
EUR 2019-02-25 Na 4
EUR 2019-02-25 Na 5
USD 2019-02-25 0,1 1
USD 2019-02-25 0,1 2
USD 2019-02-25 0,15 3
USD 2019-02-25 Na 4
USD 2019-02-25 Na 5
RON 2019-02-25 0,1 1
RON 2019-02-25 0,2 2
RON 2019-02-25 0,3 3
RON 2019-02-25 0,1 4
RON 2019-02-25 0,1 5
EUR 2019-03-01 0,15 1
EUR 2019-03-01 0,1 2
EUR 2019-03-01 Na 3
EUR 2019-03-01 Na 4
EUR 2019-03-01 Na 5
USD 2019-03-01 0,1 1
USD 2019-03-01 0,25 2
USD 2019-03-01 0,3 3
USD 2019-03-01 Na 4
USD 2019-03-01 Na 5
RON 2019-03-01 0,1 1
RON 2019-03-01 0,1 2
RON 2019-03-01 0,15 3
RON 2019-03-01 0,1 4
RON 2019-03-01 0,2 5

I have no idea how to do it, so it would be great if you could help! Thanks!

CodePudding user response:

This is a perfect opportunity for tidyr::complete.

library(dplyr)
library(tidyr)

dat  |>
    group_by(currency, date)  |>
    mutate(N = row_number())  |>
    ungroup()  |>
    complete(currency, date, N) |>
    arrange(date, currency, N)

# # A tibble: 30 x 4
#    currency date           N y    
#    <chr>    <chr>      <int> <chr>
#  1 EUR      2019-02-25     1 0,1
#  2 EUR      2019-02-25     2 0,2
#  3 EUR      2019-02-25     3 0,2
#  4 EUR      2019-02-25     4 NA
#  5 EUR      2019-02-25     5 NA
#  6 RON      2019-02-25     1 0,1
#  7 RON      2019-02-25     2 0,2
#  8 RON      2019-02-25     3 0,3
#  9 RON      2019-02-25     4 0,1
# 10 RON      2019-02-25     5 0,1
# # ... with 20 more rows

CodePudding user response:

You could use the rle function provided in base r like so:

instances = rle(dt$currency)

dt$N = unlist(sapply(instances$lengths,function(x) 1:x)) 

RLE stands for run-length encoding. The function returns the data value and counts of successive occurrences or each 'run' of values in a vector. Once we have this we access the counts through the lengths element of instances.

  • Related