Home > database >  dplyr relative frequency within group
dplyr relative frequency within group

Time:07-02

I have asked farmers of a specific farmtype (organic and conventional) that I asked for a report on species (a,b,c,d) occur (0/1) on their land.

And now I want to calculate the proportion of organic VS conventional farmers that reported any given species


df<-data.frame(farmtype=sample(c("organic","conventional"),100, replace=T),
               species=sample(letters[1:4], 100, replace=T),
               occ=sample(c("yes","no"),100, replace=T))


df %>%
  group_by(species, farmtype) %>%
  dplyr::summarise (count=n()) %>%
  group_by(species) %>%
  dplyr::mutate(prop=count/sum(count))

ggplot(df[df$occ=="yes",]) 
  geom_bar(aes(x=species, fill=farmtype))

### this looks like what I wanted at fist, but I don't want to show the proportion of organic to conventional (i.e. red/(red blue)) per species

table(df$species, df$farmtype)

### what I want is the percentage each species was recorded within organic vs. conventional farms for species a in the table that would be 12/(12 17 14 12)=21.8%

Obviously, my real dataset is more complex, so I would need a pipe-solution! Any ideas what I am missing?

CodePudding user response:

janitor's tabyl is your friend. What you're calculating is "row"-percentages, but what you want is "col"-percentages. E.g.

set.seed(1234)
df <- data.frame(farmtype=sample(c("organic","conventional"),100, replace=T),
                 species=sample(letters[1:4], 100, replace=T),
                 occ=sample(c("yes","no"),100, replace=T))

df |>
  tabyl(species,farmtype) |>
  adorn_percentages("col")

# species conventional   organic
#       a    0.2553191 0.2641509
#       b    0.2765957 0.2452830
#       c    0.2553191 0.1886792
#       d    0.2127660 0.3018868

But you could also use your own approach. Group by farmtype in the second group_by and remember to save the dataframe. This would be easier to use with ggplot2 as it is already in a long format.

df <-
  df %>%
  group_by(species, farmtype) %>%
  dplyr::summarise(count = n()) %>%
  group_by(farmtype) %>%
  dplyr::mutate(prop = count/sum(count))

df

# A tibble: 8 × 4
# Groups:   farmtype [2]
#  species farmtype     count  prop
#  <chr>   <chr>        <int> <dbl>
#  a       conventional    12 0.255
#  a       organic         14 0.264
#  b       conventional    13 0.277
#  b       organic         13 0.245
#  c       conventional    12 0.255
#  c       organic         10 0.189
#  d       conventional    10 0.213
#  d       organic         16 0.302

df %>%
  ggplot(aes(x = species, y = prop, fill = farmtype))  
  geom_col()

Update: A variant of second option also suggested by Isaac Bravo.

CodePudding user response:

Here you can have another option using your approach:

df %>%
  group_by(farmtype, species) %>%
  summarize(n = n()) %>% 
  mutate(percentage = n/sum(n)) 

OUTPUT:

  farmtype     species     n percentage
  <chr>        <chr>   <int>      <dbl>
1 conventional a          12      0.235
2 conventional b          12      0.235
3 conventional c          12      0.235
4 conventional d          15      0.294
5 organic      a          16      0.327
6 organic      b           9      0.184
7 organic      c          14      0.286
8 organic      d          10      0.204

CodePudding user response:

If I understand the poster's question correctly, the poster seeks the proportion of organic versus conventional farm types among farms that grew a given species. This can also be accomplished using the data.table package as follows.

First, the example data set is recreated by setting the seed.

set.seed(1234) ##setting seed for reproducible example
df<-data.frame(farmtype=sample(c("organic","conventional"),100, replace=T),
                species=sample(letters[1:4], 100, replace=T),
                occ=sample(c("yes","no"),100, replace=T))
 
require(data.table)
df = data.table(df)
 

Next, the "no" answers are filtered out because we are only interested in farms that reported growing the species in the "occur" column. We then count the occurrences of the species for each farm type. The column "N" gives the count.

#Filter out "no" answers because they shouldn't affect the result sought
#and count the number of farmtypes that reported each species
ans = df[occ == "yes",.N,by = .(farmtype,species)]
ans
#       farmtype species  N
#1: conventional       a  8
#2: conventional       c  8
#3:      organic       a  6
#4: conventional       d 11
#5:      organic       d  5
#6:      organic       c  7
#7:      organic       b  4
#8: conventional       b  6
 

The total occurrences of each species for either farm type are then counted. As a check for this result, each row for a given species should give the same species total.

#Total number of farms that reported the species
ans[,species_total :=  sum(N), by = species]  #
ans
#       farmtype species  N species_total
#1: conventional       a  8            14
#2: conventional       c  8            15
#3:      organic       a  6            14
#4: conventional       d 11            16
#5:      organic       d  5            16
#6:      organic       c  7            15
#7:      organic       b  4            10
#8: conventional       b  6            10
 

Finally, the columns are combined to calculate the proportion of organic or conventional farms for each species that was reported. As a check against the result, the proportion of organic and the proportion of conventional for each species should sum to 1 because there are only two farm types.

##Calculate the proportion of each farm type reported for each species
ans[, proportion := N/species_total]
ans
#      farmtype species  N species_total proportion
#1: conventional       a  8            14  0.5714286
#2: conventional       c  8            15  0.5333333
#3:      organic       a  6            14  0.4285714
#4: conventional       d 11            16  0.6875000
#5:      organic       d  5            16  0.3125000
#6:      organic       c  7            15  0.4666667
#7:      organic       b  4            10  0.4000000
#8: conventional       b  6            10  0.6000000

##Gives the proportion of organic farms specifically
ans[farmtype == "organic"]
#   farmtype species N species_total proportion
#1:  organic       a 6            14  0.4285714
#2:  organic       d 5            16  0.3125000
#3:  organic       c 7            15  0.4666667
#4:  organic       b 4            10  0.4000000

This code would need to be altered if the poster were interested in all farms that were asked about a given species regardless of whether they grew it.

  • Related