Home > Net >  For loop for dataframes in R
For loop for dataframes in R

Time:12-30

I am trying to do a function of decumulation with a for loop in R because the financial information provided by the company is accumulated for different concepts (this means that the info of January is only of January, the info of February is the sum of January and February, the one of March is the sum of January, February and March, etc.).

For example, let's say that I have the next dataframe:

Concepts <- c("Concept1", "Concept2", "Concept3")
January <- c(5,10,16)
February <- c(9,14,20)
March <- c(16,20,23)

df <- data.frame(Concepts, January, February, March)

This will give me the next dataframe:

Concepts  January  February  March
Concept1    5         9        16 
Concept2    10        14       20
Concept3    16        20       23 

What I need to achieve is the next dataframe (Notice that February is the difference between February and January, and March is the difference between February and March):

Concepts  January  February  March
Concept1    5         4        7 
Concept2    10        4        6
Concept3    16        4        3

To achieve the second dataframe, I first created an empty dataframe with the same amount of rows of df, then with a for loop cbind the first two rows of the dataframe (because they do not need any manipulation) and with the index add the next ones after calculated the difference. The above in code is as follows:

df <- data.frame(Concepts, January, February, March)
df2 <- data.frame(matrix(nrow=nrow(df),ncol=ncol(df))) #Empty Dataframe with the same number  of rows

for(i in 1:ncol(df)) {
  if(i == 1){
    df2 <- cbind(df2, df[ , i])
  } else if (i == 2){
    df2 <- cbind(df2, df[, i])
  } else {
    diference <- df[,i] - df[,i-1]
    df2 <- cbind(df2,diference)
  }

I get the following error:

error in [.data.table(df, , i) : j (the 2nd argument inside [...]) is a single symbol but column name 'i' is not found. Perhaps you intended DT[, ..i]. This difference to data.frame is deliberate and explained in FAQ 1.1.

I would love to receive a correction to my code or some alternative that allows me to calculate the above for a dataframe of many years.

CodePudding user response:

First note that if you apply base function diff to the months columns, you will get one column less but transposed.

apply(df[-1], 1, diff)
#         [,1] [,2] [,3]
#February    4    4    4
#March       7    6    4

So transpose it to get the right orientation.

t(apply(df[-1], 1, diff))
#     February March
#[1,]        4     7
#[2,]        4     6
#[3,]        4     4

And cbind it with the first 2 columns. Since the first argument is a subset of a data.frame, the method called is cbind.data.frame and the result is also a df.

cbind(df[1:2], t(apply(df[-1], 1, diff)))
#  Concepts January February March
#1 Concept1       5        4     7
#2 Concept2      10        4     6
#3 Concept3      15        4     4

CodePudding user response:

This may not be the most elegant, but it should work. The trick is to extract the numeric portion of the data frame and apply diff—by row—to the results, transpose it, and paste it back to the initial values.

df <- data.frame(Concepts = c("Concept1", "Concept2", "Concept3"),
                 January = c(5,10,16),
                 February = c(9,14,20),
                 March = c(16,20,23),
                 April = c(20, 27, 33))

dfdiff <- apply(df[, -1L], 1L, diff)
df2 <- data.frame(Concepts = c("Concept1", "Concept2", "Concept3"),
                  January = c(5,10,16))
df2 <- cbind(df2, t(dfdiff))                  
df2
  Concepts January February March April
1 Concept1       5        4     7     4
2 Concept2      10        4     6     7
3 Concept3      16        4     3    10

Now that you know how it works, for a more efficent call you could just do:

df2 <- cbind(df[, 1:2], t(apply(df[, -1L], 1L, diff)))

Which should work for any size dataframe of the structre you had above: One heading column and the remainder being cumulative data columns.

Speed comparison with tidyverse method

microbenchmark(TV = df2 <- df %>% pivot_longer(!Concepts) %>% group_by(Concepts) %>%
                 dplyr::mutate(value2 = value - lag(value, default = first(value))) %>%
                 rowwise %>%mutate(value2 = ifelse(value2 == 0, value, value2)) %>%
                 select(-value) %>%pivot_wider(names_from = "name", values_from = "value2"),
               BASE = df2 <- cbind(df[, 1:2], t(apply(df[, -1L], 1L, diff))),
               times = 1000L, control = list(order = 'block'))

Unit: microseconds
 expr     min       lq       mean   median       uq     max neval cld
   TV 11141.7 11554.05 12245.1253 11803.75 12300.25 22903.5  1000   b
 BASE   160.4   164.35   176.8356   165.70   168.70  3833.2  1000  a 

CodePudding user response:

A tidyverse solution:

library(tidyverse)

df %>%
  pivot_longer(!Concepts) %>%
  group_by(Concepts) %>%
  mutate(value2 = value - lag(value, default = first(value))) %>%
  rowwise %>%
  mutate(value2 = ifelse(value2 == 0, value, value2)) %>%
  select(-value) %>%
  pivot_wider(names_from = "name", values_from = "value2")

Output

# A tibble: 3 × 4
  Concepts January February March
  <chr>      <dbl>    <dbl> <dbl>
1 Concept1       5        4     7
2 Concept2      10        4     6
3 Concept3      16        4     3
  • Related