Home > Enterprise >  Add share column based on year and other categorical variable to dataframe in R
Add share column based on year and other categorical variable to dataframe in R

Time:04-20

I have sales data by year, condition and products

Year <- c(2010,2010,2010,2010,2010,2010,2011,2011,2011,2011,2011,2011,2012,2012,2012,2012,2012,2012)
Sale <- c("30","45","23","33","24","11","56","19","45","56","33","32","89","33","12",18,10,17)
Condition <- c("New","New","New","Used","Used","Used","New","New","New","Used","Used","Used","New","New","New","Used","Used","Used")
Product <- c("a","b","c","a","b","c","a","b","c","a","b","c","a","b","c","a","b","c")
df <- data.frame(Year,Condition, Product, Sale)

Now I want to calculate the share of each product by condition variable within each year. I tried the following code, but it calculates based on total no by year and "condition"

df$percentage <- df$Sale/sum(df$Sale)*100

CodePudding user response:

First convert Sale from character to numeric with type.convert(as.is = TRUE),

then group by the desired columns and apply summarise:

Note that in your provided dataframe you will get 100 for percentage because of your provided data:

With this fake data

set.seed(123)
Year <- sample(c(2010, 2011, 2012), 18, replace = TRUE)
Sale <- c("30","45","23","33","24","11","56","19","45","56","33","32","89","33","12",18,10,17)
Condition <- sample(c("Used","New"), 18, replace = TRUE)
Product <- sample(c("a","b","c"), 18, replace = TRUE)
df <- data.frame(Year,Condition, Product, Sale)

using this code

library(dplyr)
df %>% 
  type.convert(as.is=TRUE) %>% 
  group_by(Year, Product, Condition) %>% 
  summarise(percentage = Sale/sum(Sale)*100)

you will get:

    Year Product Condition percentage
   <int> <chr>   <chr>          <dbl>
 1  2010 a       Used            83.2
 2  2010 a       Used            16.8
 3  2010 c       New            100  
 4  2011 a       New            100  
 5  2011 a       Used            42.9
 6  2011 a       Used            14.3
 7  2011 a       Used            42.9
 8  2011 b       New            100  
 9  2011 c       New             49.2
10  2011 c       New             50.8
11  2012 a       Used            63.8
12  2012 a       Used            36.2
13  2012 b       New            100  
14  2012 b       Used            69.7
15  2012 b       Used            30.3
16  2012 c       New            100  
17  2012 c       Used            34.8
18  2012 c       Used            65.2

Update: to keep Sale column: replace summarise with mutate

df %>% 
  type.convert(as.is=TRUE) %>% 
  group_by(Year, Product, Condition) %>% 
  mutate(percentage = paste(round(Sale/sum(Sale)*100, 1), "%"))
    Year Condition Product  Sale percentage
   <int> <chr>     <chr>   <int> <chr>     
 1  2012 Used      a          30 63.8 %    
 2  2012 New       c          45 100 %     
 3  2012 Used      b          23 69.7 %    
 4  2011 Used      a          33 42.9 %    
 5  2012 Used      c          24 34.8 %    
 6  2011 Used      a          11 14.3 %    
 7  2011 New       a          56 100 %     
 8  2011 New       b          19 100 %     
 9  2012 Used      c          45 65.2 %    
10  2010 New       c          56 100 %     
11  2011 Used      a          33 42.9 %    
12  2011 New       c          32 49.2 %    
13  2010 Used      a          89 83.2 %    
14  2011 New       c          33 50.8 %    
15  2012 New       b          12 100 %     
16  2010 Used      a          18 16.8 %    
17  2012 Used      b          10 30.3 %    
18  2012 Used      a          17 36.2 % 

CodePudding user response:

Here is a base solution using ave(). You can replace grouping variables in ave with any others you want.

within(df, {
  perc1 = ave(as.numeric(Sale), Year, Product, FUN = proportions) * 100
  perc2 = sprintf("%.1f %%", perc1)
})

   Year Condition Product Sale  perc2     perc1
1  2010       New       a   30 47.6 %  47.61905
2  2010       New       b   45 65.2 %  65.21739
3  2010       New       c   23 67.6 %  67.64706
4  2010      Used       a   33 52.4 %  52.38095
5  2010      Used       b   24 34.8 %  34.78261
6  2010      Used       c   11 32.4 %  32.35294
7  2011       New       a   56 50.0 %  50.00000
8  2011       New       b   19 36.5 %  36.53846
9  2011       New       c   45 58.4 %  58.44156
10 2011      Used       a   56 50.0 %  50.00000
11 2011      Used       b   33 63.5 %  63.46154
12 2011      Used       c   32 41.6 %  41.55844
  • Related