Home > OS >  How to go from wide to long in R with different patterns
How to go from wide to long in R with different patterns

Time:11-29

I have a dataframe like the following:

df_mock <-structure(list(`Company name Latin alphabet` = c("A", "B", "C"
), `NACE Rev. 2, core code (4 digits)` = c("2014", "1041", "2910"
), `Number of employees
2022` = c(NA_real_, NA_real_, NA_real_
), `Number of employees
2021` = c(3433, 779, 5744), `Number of employees
2020` = c(3342, 
737, 5664), `Number of employees
2019` = c(3228, 693, 5528), 
    `Number of employees
2018` = c(3109, 665, 5284), `Number of employees
2017` = c(3021, 
    661, 5242), `Number of employees
2016` = c(3007, 519, 5100
    ), `Operating revenue (Turnover)
th EUR 2022` = c(NA_real_, 
    NA_real_, NA_real_), `Operating revenue (Turnover)
th EUR 2021` = c(7529599, 
    5414117, 5170430), `Operating revenue (Turnover)
th EUR 2020` = c(4374804, 
    4187030, 5074135), `Operating revenue (Turnover)
th EUR 2019` = c(5214974, 
    3717059, 5044307), `Operating revenue (Turnover)
th EUR 2018` = c(6827041, 
    3133256, 4473499), `Operating revenue (Turnover)
th EUR 2017` = c(6437798, 
    2920410, 4891161), `Operating revenue (Turnover)
th EUR 2016` = c(5294858, 
    2944949, 5078559), region = c("X", "X", "X"), unique_id = c("1", 
    "2", "3")), row.names = c("1:1", "1:2", "1:3"), class = "data.frame")

I want to go from wide to long, keeping several columns and converting different others. In particular, I want to keep the Company name, NACE, region, and unique_id. I want to melt into this structure the rest of variables into 3 different ones (year, employment, and revenue) to have the following desired dataset:

df_desired <- structure(list(Company_name_Latin_alphabet = c("A", "A", "A", 
"A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "B", "C", "C", 
"C", "C", "C", "C", "C"), NACE_Rev_2_core_code_4_digits = c("2014", 
"2014", "2014", "2014", "2014", "2014", "2014", "1041", "1041", 
"1041", "1041", "1041", "1041", "1041", "2910", "2910", "2910", 
"2910", "2910", "2910", "2910"), region = c("X", "X", "X", "X", 
"X", "X", "X", "X", "X", "X", "X", "X", "X", "X", "X", "X", "X", 
"X", "X", "X", "X"), unique_id = c("1", "1", "1", "1", "1", "1", 
"1", "2", "2", "2", "2", "2", "2", "2", "3", "3", "3", "3", "3", 
"3", "3"), year = c("2022", "2021", "2020", "2019", "2018", "2017", 
"2016", "2022", "2021", "2020", "2019", "2018", "2017", "2016", 
"2022", "2021", "2020", "2019", "2018", "2017", "2016"), employment = c("NA", 
"3433", "3342", "3228", "3109", "3021", "3007", "NA", "779", 
"737", "693", "665", "661", "519", "NA", "5744", "5664", "5528", 
"5284", "5242", "5100"), revenue = c("NA", "7529599", "4374804", 
"5214974", "6827041", "6437798", "5294858", "NA", "5414117", 
"4187030", "3717059", "3133256", "2920410", "2944949", "NA", 
"5170430", "5074135", "5044307", "4473499", "4891161", "5078559"
)), class = "data.frame", row.names = c(NA, -21L))

I have tried with reshape function without but I haven't reach out the desired outcome:

reshape(belgium_orbis, direction='long', 
        varying=c('Number of employees\n2022', 
        'Number of employees\n2021', 
        'Number of employees\n2020', 
        'Number of employees\n2019', 
        'Number of employees\n2018', 
        'Number of employees\n2017', 
        'Number of employees\n2016', 
        'Operating revenue (Turnover)\nth EUR 2022', 
        'Operating revenue (Turnover)\nth EUR 2021', 
        'Operating revenue (Turnover)\nth EUR 2020', 
        'Operating revenue (Turnover)\nth EUR 2019', 
        'Operating revenue (Turnover)\nth EUR 2018', 
        'Operating revenue (Turnover)\nth EUR 2017', 
        'Operating revenue (Turnover)\nth EUR 2016'), 
        timevar='Year',
        times=c('2022', '2021', '2020', '2019', '2018', '2017', '2016'),
        v.names=c('employment', 'revenue'),
        idvar='id')

Any suggestion?

CodePudding user response:

We could use pivot_longer - specify the columns to reshape by matching the substring 'revenue', 'employees' in column names with matches. In names_to pass a vector of .value (value part of the column), and year (values of that column will be the substring of column name). As we pass a length of 2 in names_to, we need to capture two substring from the column names by matching the employees|revenue as a group and then last 4 digits (\\d{4}) before the end ($) of the string.

library(tidyr)
pivot_longer(df_mock, cols = matches('revenue|employees'),
    names_to = c(".value", "year"), 
    names_pattern = ".*(employees|revenue)\\D (\\d{4})$")
  • Related