I have a data frame with two rows that represent passengers loading onto the bus and passengers leaving the bus:
A B C D E F
In 9 10 6 9 14 10
Out 0 1 2 3 4 3
And I wanted to do a calculation that will result in two more rows of information, where the first is passengers who are on the bus when it arrives at Station A/B/C/etc, and Row 2 is the number of passengers who have left the bus at that station.
The numbers in row 1 should be the same from the previous number in row 2, and row 2 for Station B, for example, is `9 (the number of people who remained on the bus from the previous stop) BIn (the number of people getting on at the stop) - BOut (the number of people who are getting off the bus at the stop).
The final result should look like:
A B C D E F
In 9 10 6 9 14 10
Out 0 1 2 3 4 3
1 0 9 18 22 28 38
2 9 18 22 28 38 45
How would I iterate through the data frame so that I can obtain these numbers? Is a for loop needed, or is there an easier way to go through this calculation?
CodePudding user response:
First, I think it makes more sense to have these as columns rather than rows. That way you can take advantage of the vectorized operations in R.
library(data.table)
df <- suppressWarnings(fread('
A B C D E F
In 9 10 6 9 14 10
Out 0 1 2 3 4 3'))
setDT(df) # only required if not starting with a data.table
df
#> V1 A B C D E F
#> <char> <int> <int> <int> <int> <int> <int>
#> 1: In 9 10 6 9 14 10
#> 2: Out 0 1 2 3 4 3
df_tp <- transpose(df, make.names = 'V1', keep.names = 'station')
df_tp
#> station In Out
#> <char> <int> <int>
#> 1: A 9 0
#> 2: B 10 1
#> 3: C 6 2
#> 4: D 9 3
#> 5: E 14 4
#> 6: F 10 3
Now your last row is the cumulative sum of In minus the cumulative sum of Out. The other one is just a lagged version of that.
df_tp[, net := cumsum(In) - cumsum(Out)]
df_tp[, lagged_net := shift(net, fill = 0)]
df_tp
#> station In Out net lagged_net
#> <char> <int> <int> <int> <int>
#> 1: A 9 0 9 0
#> 2: B 10 1 18 9
#> 3: C 6 2 22 18
#> 4: D 9 3 28 22
#> 5: E 14 4 38 28
#> 6: F 10 3 45 38
Created on 2021-12-07 by the reprex package (v2.0.1)
CodePudding user response:
I think you should take @IceCreamToucan's advice and answer, but if you want to keep the same structure for specific reasons, this inelegant, brute-force for
loop will produce your desired output:
df <- data.frame(A = c(9,0),
B = c(10,1),
C = c(6,2),
D = c(9,3),
E = c(14, 4),
F = c(10, 3))
for (i in 1:ncol(df)){
if (i == 1){df[3:4,1] <- c(0,df[1,1])}
else{
df[3,i] <- df[4,i-1]
df[4,i] <- sum(df[4,i-1], df[1,i]) - df[2,i]
}
}
df
# A B C D E F
#1 9 10 6 9 14 10
#2 0 1 2 3 4 3
#3 0 9 18 22 28 38
#4 9 18 22 28 38 45
CodePudding user response:
Or for a tidyverse
way to do it:
Load data in the format you shared it:
library(tidyverse)
df <- data.frame(A = c(9,0),
B = c(10,1),
C = c(6,2),
D = c(9,3),
E = c(14,4),
F = c(10,3))
> df
A B C D E F
1 9 10 6 9 14 10
2 0 1 2 3 4 3
Transform to long format:
df <- as_tibble(t(df), rownames = "row_names") %>%
rename('In' = V1, 'Out' = V2)
> df
# A tibble: 6 x 3
row_names In Out
<chr> <dbl> <dbl>
1 A 9 0
2 B 10 1
3 C 6 2
4 D 9 3
5 E 14 4
6 F 10 3
Add variables you want with cumsum
and lag
:
df %>% mutate(net = cumsum(In) - cumsum(Out),
lag = replace_na(lag(net), 0))
> df
# A tibble: 6 x 5
row_names In Out net lag
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 9 0 9 0
2 B 10 1 18 9
3 C 6 2 22 18
4 D 9 3 28 22
5 E 14 4 38 28
6 F 10 3 45 38