Simultaneously converting columns to long format in R-CodePudding

I was wondering if there might be a way to convert each set of similarly named columns (ratio_..., EL_..., Teacher_...) to long format in R?

I have tried the following solution (using tidyr::pivot_longer()) but I get several NAs in my output.

My suspicion is that these NAs have to do with names_sep = "_". But I simply want to have one column as the key (to be called year) populated with numeric years like 2019, 2020, 2021.

library(tidyverse)

data <- read.csv("https://raw.githubusercontent.com/ilzl/i/master/prac.csv") %>%    
pivot_longer(cols = -id, names_to = c(".value", "year"), names_sep = "_")

CodePudding user response：

You could replace the _ prior to the year with another value, like "X", and use names_sep="X":

data %>% 
  rename_with(~str_replace(.x, pattern="_(?=20\\d{2})","X"), .cols = -id) %>% 
  pivot_longer(-id,names_to=c(".value", "year"), names_sep="X")

Alternatively, you can forgo the renaming of the columns, if you use names_pattern, with a similar regex pattern:

pivot_longer(
  data, 
  -id,
  names_to=c(".value", "year"),
  names_pattern = "(.*(?=20\\d{2}))(.)"
)

CodePudding user response：

In this case you want to split on the last occurrence of _, one way to do this is to use a use a positive lookahead in the names_sep argument:

library(tidyr)

read.csv("https://raw.githubusercontent.com/ilzl/i/master/prac.csv") %>%
  pivot_longer(
    cols = -id,
    names_to = c(".value", "year"),
    names_sep = "_(?=[^_] $)"
  )

# A tibble: 246 × 5
      id year    ratio_EL2FC EL_count Teacher_count
   <int> <chr>         <dbl>    <int>         <int>
 1  1894 2019.20        30.5       61             2
 2  1894 2020.21        32         64             2
 3  1894 2021.22        32.5       65             2
 4  1900 2019.20        24         24             1
 5  1900 2020.21        23         23             1
 6  1900 2021.22        17         17             1
 7  1901 2019.20        41.2      535            13
 8  1901 2020.21        36.8      552            15
 9  1901 2021.22        38.5      577            15
10  1923 2019.20        56.2      225             4
# … with 236 more rows
# ℹ Use `print(n = ...)` to see more rows