Home > OS >  How to tapply in dplyr and create a new column
How to tapply in dplyr and create a new column

Time:05-26

I´m stuck with dplyr (again!) and trying to solve my problem without dying in the attemp.

The first lines of my df look like this:

df <- structure(list(fecha = c(1990, 1990, 1990, 1990, 1990, 1990, 
1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990), cientifico = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "Argentina sphyraena", class = "factor"), 
    dem_sect = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L), .Label = c("AB", "EP", "FE", "MF", 
    "PA"), class = "factor"), sector = c("EPb", "EPc", "EPc", 
    "EPb", "EPa", "EPa", "EPb", "EPc", "EPb", "EPb", "EPb", "EPb", 
    "EPb", "EPb", "EPa"), md_area = c(3010.44, 665.88, 665.88, 
    3010.44, 1273.65, 1273.65, 3010.44, 665.88, 3010.44, 3010.44, 
    3010.44, 3010.44, 3010.44, 3010.44, 1273.65), md_peso = c(1.42957605985037, 
    1.04499099099099, 1.04499099099099, 1.42957605985037, 1.24025925925926, 
    1.24025925925926, 1.42957605985037, 1.04499099099099, 1.42957605985037, 
    1.42957605985037, 1.42957605985037, 1.42957605985037, 1.42957605985037, 
    1.42957605985037, 1.24025925925926), dummy = c(4303.65295361596, 
    695.838601081081, 695.838601081081, 4303.65295361596, 1579.65620555556, 
    1579.65620555556, 4303.65295361596, 695.838601081081, 4303.65295361596, 
    4303.65295361596, 4303.65295361596, 4303.65295361596, 4303.65295361596, 
    4303.65295361596, 1579.65620555556)), row.names = c(NA, -15L
), class = "data.frame")

I´m trying to "translate" this: sumsect <- tapply(md_peso * md_area, as.factor(substr(names(sector), 1, 2)), sum) into dplyr. But with no success although I´ve tried many many approaches. I added a column ("dem_sect") which will be the result of as.factor(substr(names(sector), 1, 2)) in an attempt to solve the problem, but I failed.

The desired output would be a data frame with a new column: "sumsect" (with the same value (in this case 6579.148 (the sum of md_peso * md_area by sector (1579.6562 4303.6530 695.8386))

    fecha  cientifico          dem_sect sector md_area md_peso  dummy  sumsect
1   1990 Argentina sphyraena       EP    EPb 3010.44 1.429576 4303.6530 6579.148
2   1990 Argentina sphyraena       EP    EPc  665.88 1.044991  695.8386 6579.148
3   1990 Argentina sphyraena       EP    EPc  665.88 1.044991  695.8386 6579.148
4   1990 Argentina sphyraena       EP    EPb 3010.44 1.429576 4303.6530 6579.148
5   1990 Argentina sphyraena       EP    EPa 1273.65 1.240259 1579.6562 6579.148
6   1990 Argentina sphyraena       EP    EPa 1273.65 1.240259 1579.6562 6579.148
7   1990 Argentina sphyraena       EP    EPb 3010.44 1.429576 4303.6530 6579.148
8   1990 Argentina sphyraena       EP    EPc  665.88 1.044991  695.8386 6579.148
9   1990 Argentina sphyraena       EP    EPb 3010.44 1.429576 4303.6530 6579.148
10  1990 Argentina sphyraena       EP    EPb 3010.44 1.429576 4303.6530 6579.148
11  1990 Argentina sphyraena       EP    EPb 3010.44 1.429576 4303.6530 6579.148
12  1990 Argentina sphyraena       EP    EPb 3010.44 1.429576 4303.6530 6579.148
13  1990 Argentina sphyraena       EP    EPb 3010.44 1.429576 4303.6530 6579.148
14  1990 Argentina sphyraena       EP    EPb 3010.44 1.429576 4303.6530 6579.148
15  1990 Argentina sphyraena       EP    EPa 1273.65 1.240259 1579.6562 6579.148

Any hint will be more than welcome. Thanks in advance

CodePudding user response:

You can just mutate and then summarise the unique values of dummy

df |> 
  mutate(sumsect = sum(unique(dummy)))

if you're reliant on md_area and md_peso you can use:

df |> 
  mutate(sumsect = sum(unique(md_area * md_peso)))

CodePudding user response:

You don't need tapply if you will work with dpylr. No necesitas tapply si vas a trabajar con dpylr.

library(tidyverse)
df %>% # target dataframe
  cbind( # we will join a value as a new column for every row
    df %>% # work with dataframe df
    group_by(sector) %>% # calculate by sector
    summarise(sumsect = unique(md_area*md_peso)) %>% # the md_area*md _peso
    ungroup() %>% # remove grouping
    summarise(sumsect = sum(sumsect)) # sum the 3 calculated values
  )

Output:

   fecha          cientifico dem_sect sector md_area  md_peso     dummy  sumsect
1   1990 Argentina sphyraena       EP    EPb 3010.44 1.429576 4303.6530 6579.148
2   1990 Argentina sphyraena       EP    EPc  665.88 1.044991  695.8386 6579.148
3   1990 Argentina sphyraena       EP    EPc  665.88 1.044991  695.8386 6579.148
4   1990 Argentina sphyraena       EP    EPb 3010.44 1.429576 4303.6530 6579.148
5   1990 Argentina sphyraena       EP    EPa 1273.65 1.240259 1579.6562 6579.148
6   1990 Argentina sphyraena       EP    EPa 1273.65 1.240259 1579.6562 6579.148
7   1990 Argentina sphyraena       EP    EPb 3010.44 1.429576 4303.6530 6579.148
8   1990 Argentina sphyraena       EP    EPc  665.88 1.044991  695.8386 6579.148
9   1990 Argentina sphyraena       EP    EPb 3010.44 1.429576 4303.6530 6579.148
10  1990 Argentina sphyraena       EP    EPb 3010.44 1.429576 4303.6530 6579.148
11  1990 Argentina sphyraena       EP    EPb 3010.44 1.429576 4303.6530 6579.148
12  1990 Argentina sphyraena       EP    EPb 3010.44 1.429576 4303.6530 6579.148
13  1990 Argentina sphyraena       EP    EPb 3010.44 1.429576 4303.6530 6579.148
14  1990 Argentina sphyraena       EP    EPb 3010.44 1.429576 4303.6530 6579.148
15  1990 Argentina sphyraena       EP    EPa 1273.65 1.240259 1579.6562 6579.148

If it is possible that you want to calculate sumsect by grouped cientifico or fecha or both you can group them. In your example there is only one.

En tu ejemplo solo tienes 1 fecha y 1 cientifico. Si quieres que sumsect sea distinto para cada level de esas columnas no te olvides de agrupar también con esas columnas.

CodePudding user response:

Update: Seeing @Jahi Zamy answer 1 it is also possible using no grouping: Grouping would have the chance to control over different groups in the real data set:

df %>% 
  mutate(sumsect = sum(unique( md_peso * md_area)))

First answer: You can do it this way with dplyr: The trick is using group_by and then ungroup() and sum with unique values. In case you want to sum for specific groups, then instead of ungroup use group_by the desired group:

df %>% 
  group_by(sector) %>% 
  mutate(y = md_peso * md_area) %>% 
  ungroup() %>% 
  mutate(sumsect = sum(unique(y)), .keep="unused")
   fecha cientifico          dem_sect sector md_area md_peso dummy sumsect
   <dbl> <fct>               <fct>    <chr>    <dbl>   <dbl> <dbl>   <dbl>
 1  1990 Argentina sphyraena EP       EPb      3010.    1.43 4304.   6579.
 2  1990 Argentina sphyraena EP       EPc       666.    1.04  696.   6579.
 3  1990 Argentina sphyraena EP       EPc       666.    1.04  696.   6579.
 4  1990 Argentina sphyraena EP       EPb      3010.    1.43 4304.   6579.
 5  1990 Argentina sphyraena EP       EPa      1274.    1.24 1580.   6579.
 6  1990 Argentina sphyraena EP       EPa      1274.    1.24 1580.   6579.
 7  1990 Argentina sphyraena EP       EPb      3010.    1.43 4304.   6579.
 8  1990 Argentina sphyraena EP       EPc       666.    1.04  696.   6579.
 9  1990 Argentina sphyraena EP       EPb      3010.    1.43 4304.   6579.
10  1990 Argentina sphyraena EP       EPb      3010.    1.43 4304.   6579.
11  1990 Argentina sphyraena EP       EPb      3010.    1.43 4304.   6579.
12  1990 Argentina sphyraena EP       EPb      3010.    1.43 4304.   6579.
13  1990 Argentina sphyraena EP       EPb      3010.    1.43 4304.   6579.
14  1990 Argentina sphyraena EP       EPb      3010.    1.43 4304.   6579.
15  1990 Argentina sphyraena EP       EPa      1274.    1.24 1580.   6579.
  • Related