Home > Enterprise >  How do I reshape an R dataframe from long to wide without using the long variable values as the new
How do I reshape an R dataframe from long to wide without using the long variable values as the new

Time:08-26

I have a long dataframe that lists the top 3 employers of each occupation code (3 rows per occupation code). It looks like this.

occcode employer
1 top employer for occcode1
1 2nd employer for occcode 1
1 3rd employer for occcode 1
2 top employer for occcode2
2 2nd employer for occcode 2
2 3rd employer for occcode 1

I want to reshape it so that I have one row per occupation code, and columns named "emp1", "emp2", and "emp3" that are respectively populated with the 1st-3rd employers of that occupation code.

occcode employer1 employer2 employer3
1 top employer for occcode1 2nd employer for occcode 1 3rd employer for occcode 1
2 top employer for occcode2 2nd employer for occode2 3rd employer for occcode 1

I previously thought using the spread() function would work. But reading the documentation and testing it out, it doesn't produce what I have in mind because it requires that the values in "employer" in the long version of the data be standardized (such that there are only 3 employer names); that's not the case because employer names vary a lot across occupation codes. What is the best way to do reshape the data in line with what I need?

CodePudding user response:

I removed the last row of source data to show that this should work for variable numbers of employers per occcode:

library(tidyverse)      
data.frame(
  stringsAsFactors = FALSE,
           occcode = c(1L, 1L, 1L, 2L, 2L),
          employer = c("top employer for occcode1",
                       "2nd employer for occcode 1","3rd employer for occcode 1",
                       "top employer for occcode2",
                       "2nd employer for occcode 2")
) %>%
  
  group_by(occcode) %>%
  mutate(col = paste0("employer", row_number())) %>%
  ungroup() %>%
  pivot_wider(names_from = col, values_from = employer)

Result

# A tibble: 2 × 4
  occcode employer1                 employer2                  employer3                 
    <int> <chr>                     <chr>                      <chr>                     
1       1 top employer for occcode1 2nd employer for occcode 1 3rd employer for occcode 1
2       2 top employer for occcode2 2nd employer for occcode 2 NA  

CodePudding user response:

Here is another approach:

library(data.table)
dcast(
  setDT(df)[, emp:={emp=substr(employer,1,1);emp=paste0("employer",fifelse(emp=="t","1",emp))}],
  occcode~emp, value.var="employer"
)

Output:

   occcode                 employer1                  employer2                  employer3
1:       1 top employer for occcode1 2nd employer for occcode 1 3rd employer for occcode 1
2:       2 top employer for occcode2 2nd employer for occcode 2 3rd employer for occcode 2

Input:

structure(list(occcode = c(1L, 1L, 1L, 2L, 2L, 2L), employer = c("top employer for occcode1", 
"2nd employer for occcode 1", "3rd employer for occcode 1", "top employer for occcode2", 
"2nd employer for occcode 2", "3rd employer for occcode 2")), row.names = c(NA, 
-6L), class = "data.frame")
  • Related