Home > Enterprise >  Shaping a dataframe that contains a column with both the categories and subcategories in R
Shaping a dataframe that contains a column with both the categories and subcategories in R

Time:07-29

I really need your help with processing this dataframe. Let's assume we have a dataframe like the one below.

category   value

A           NA
high        20
mid         18
low         10
B           NA
high        22
mid         12
low         11
C           NA
low         14

As you can see, there are 3 categories A, B, and with possible subcategories high, mid, low. The category names have NA value because only each category's subcategory has a value, not the category itself. What makes this complicated is the fact that category C has only a single subcategory, which is low. The format I would like to convert it to, looks something like this:

category   subcategory   value

A              high       20
A              mid        18
A              low        10
B              high       22
B              mid        12
B              low        11
C              low        14

I would really appreciate if someone could help me with this. I have been trying to figure it out for a long while. Thanks!

CodePudding user response:

You can define the sub-category values (cat_value) and bring them in a separate column (subcategory). Change the sub-category values to NA and use tidyr::fill to fill them with category values.

library(dplyr)

cat_value <- c('high', 'mid', 'low')

df %>%
  mutate(subcategory = replace(category, !category %in% cat_value, NA) ,
         category = replace(category, category %in% cat_value, NA)) %>%
  tidyr::fill(category) %>%
  filter(!is.na(subcategory)) %>%
  select(category, subcategory, value)

#  category subcategory value
#1        A        high    20
#2        A         mid    18
#3        A         low    10
#4        B        high    22
#5        B         mid    12
#6        B         low    11
#7        C         low    14

data

df <- structure(list(category = c("A", "high", "mid", "low", "B", "high", 
"mid", "low", "C", "low"), value = c(NA, 20L, 18L, 10L, NA, 22L, 
12L, 11L, NA, 14L)), class = "data.frame", row.names = c(NA, -10L))
  • Related