Home > OS >  Efficient way to repeat operations with columns with similar name in R
Efficient way to repeat operations with columns with similar name in R

Time:08-10

I am a beginner with R and have found myself repeatedly running into a problem of this kind. Say I have a dataframe with columns:

company, shares_2010, shares_2011, ... , shares_2020, share_price_2010, ... , share_price_2020
TeslaInc     1000          1200               2000          8                    40        
.
.
.      

I then want to go ahead and calculate the market value in each year. Ordinarily I would do it this way:

dataframe <- dataframe %>%
    mutate(value_2010 = shares_2010*share_price_2010,
           value_2011 = shares_2011*share_price_2011,
                     .                             
                     :                                
           value_2020 = shares_2020*share_price_2020)

Clearly, all of this is rather cumbersome to type out each time and it cannot be made dynamic with respect to the number of time periods included. Is there any clever way to do these operations in one line instead? I am suspecting something may be possible to do with a combination of starts_with() and some lambda function, but I just haven't been able to figure out how to make the correct things multiply yet. Surely the tidyverse must have a better way to do this?

Any help is much appreciated!

CodePudding user response:

You're right, this is a very common situation in data management.

Let's make a minimal, reproducible example:

dat <- data.frame(
  company = c("TeslaInc", "Merta"),
  shares_2010 = c(1000L, 1500L),
  shares_2011 = c(1200L, 1100L),
  shareprice_2010 = 8:7,
  shareprice_2011 = c(40L, 12L)
)

dat
#>    company shares_2010 shares_2011 shareprice_2010 shareprice_2011
#> 1 TeslaInc        1000        1200               8              40
#> 2    Merta        1500        1100               7              12

This dataset has two issues:

  1. It's in a wide format. This is relatively easy to visualise for humans, but it's not ideal for data analysis. We can fix this with pivot_longer() from tidyr.
  2. Each column actually contains two variables: measure (share or share price) and year. We can fix this with separate() from the same package.
library(tidyr)

dat_reshaped <- dat |>
  pivot_longer(shares_2010:shareprice_2011) |> 
  separate(name, into = c("name", "year")) |> 
  pivot_wider(everything(), values_from = value, names_from = name)

dat_reshaped
#> # A tibble: 4 × 4
#>   company  year  shares shareprice
#>   <chr>    <chr>  <int>      <int>
#> 1 TeslaInc 2010    1000          8
#> 2 TeslaInc 2011    1200         40
#> 3 Merta    2010    1500          7
#> 4 Merta    2011    1100         12

The last pivot_wider() is needed to have shares and shareprice as two separate columns, for ease of further calculations.

We can finally use mutate() to calculate in one go all the new values.

dat_reshaped |>
  dplyr::mutate(value = shares * shareprice)
#> # A tibble: 4 × 5
#>   company  year  shares shareprice value
#>   <chr>    <chr>  <int>      <int> <int>
#> 1 TeslaInc 2010    1000          8  8000
#> 2 TeslaInc 2011    1200         40 48000
#> 3 Merta    2010    1500          7 10500
#> 4 Merta    2011    1100         12 13200

I recommend you read this chapter of R4DS to better understand these concepts - it's worth the effort!

CodePudding user response:

I think further analysis will be simpler if you reshape your data long.

Here, we can extract the shares, share_price, and year from the header names using pivot_longer. Here, I specify that I want to split the headers into two pieces separated by _, and I want to put the name (aka .value) from the beginning of the header (that is, share or share_price) next to the year that came from the end of the header.

Then the calculation is a simple one-liner.

library(tidyr); library(dplyr)
data.frame(company = "Tesla", 
           shares_2010 = 5, shares_2011 = 6,
           share_price_2010 = 100, share_price_2011 = 110) %>%

  pivot_longer(-company, 
               names_to = c(".value", "year"), 
               names_pattern = "(.*)_(.*)") %>%
  mutate(value = shares * share_price)


# A tibble: 2 × 5
  company year  shares share_price value
  <chr>   <chr>  <dbl>       <dbl> <dbl>
1 Tesla   2010       5         100   500
2 Tesla   2011       6         110   660

CodePudding user response:

I agree with the other posts about pivoting this data into a longer format. Just to add a different approach that works well with this type of example: you can create a list of expressions and then use the splice operator !!! to evaluate these expressions within your context:

library(purrr)
library(dplyr)
library(rlang)
library(glue)

lexprs <- set_names(2010:2011, paste0("value_", 2010:2011)) %>% 
  map_chr(~ glue("shares_{.x} * share_price_{.x}")) %>% 
  parse_exprs()

df %>% 
  mutate(!!! lexprs)

Output

   company shares_2010 shares_2011 share_price_2010 share_price_2011 value_2010
1 TeslaInc        1000        1200                8               40       8000
2    Merta        1500        1100                7               12      10500
  value_2011
1      48000
2      13200

Data

Thanks to Andrea M

structure(list(company = c("TeslaInc", "Merta"), shares_2010 = c(1000L, 
1500L), shares_2011 = c(1200L, 1100L), share_price_2010 = 8:7, 
    share_price_2011 = c(40L, 12L)), class = "data.frame", row.names = c(NA, 
-2L))

How it works

With this usage, the splice operator takes a named list of expressions. The names of the list become the variable names and the expressions are evaluated in the context of your mutate statement.

> lexprs

$value_2010
shares_2010 * share_price_2010

$value_2011
shares_2011 * share_price_2011

To see how this injection will resolve, we can use rlang::qq_show:

> rlang::qq_show(df %>% mutate(!!! lexprs))
df %>% mutate(value_2010 = shares_2010 * share_price_2010, value_2011 = shares_2011 *
  share_price_2011)

CodePudding user response:

It is indeed likely you may need to have your data in a long format. But in case you don't, you can do this:

# thanks Andrea M!
df <- data.frame(
  company=c("TeslaInc", "Merta"),
  shares_2010=c(1000L, 1500L),
  shares_2011=c(1200L, 1100L),
  share_price_2010=8:7,
  share_price_2011=c(40L, 12L)
)

years <- sub('shares_', '', grep('^shares_', names(df), value=T))
for (year in years) {
  df[[paste0('value_', year)]] <- 
    df[[paste0('shares_', year)]] * df[[paste0('share_price_', year)]]
}

If you wanted to avoid the loop (for (...) {...}) you can use this instead:

sp <- df[, paste0('shares_', years)] * df[, paste0('share_price_', years)]
names(sp) <- paste0('value_', years)
df <- cbind(df, sp)
  • Related