I was wondering if there might be a way to convert each set of similarly named columns (ratio_...
, EL_...
, Teacher_...
) to long format in R?
I have tried the following solution (using tidyr::pivot_longer()
) but I get several NA
s in my output.
My suspicion is that these NA
s have to do with names_sep = "_"
. But I simply want to have one column as the key (to be called year) populated with numeric years like 2019, 2020, 2021.
library(tidyverse)
data <- read.csv("https://raw.githubusercontent.com/ilzl/i/master/prac.csv") %>%
pivot_longer(cols = -id, names_to = c(".value", "year"), names_sep = "_")
CodePudding user response:
You could replace the _
prior to the year with another value, like "X", and use names_sep="X"
:
data %>%
rename_with(~str_replace(.x, pattern="_(?=20\\d{2})","X"), .cols = -id) %>%
pivot_longer(-id,names_to=c(".value", "year"), names_sep="X")
Alternatively, you can forgo the renaming of the columns, if you use names_pattern
, with a similar regex pattern:
pivot_longer(
data,
-id,
names_to=c(".value", "year"),
names_pattern = "(.*(?=20\\d{2}))(.)"
)
CodePudding user response:
In this case you want to split on the last occurrence of _
, one way to do this is to use a use a positive lookahead in the names_sep
argument:
library(tidyr)
read.csv("https://raw.githubusercontent.com/ilzl/i/master/prac.csv") %>%
pivot_longer(
cols = -id,
names_to = c(".value", "year"),
names_sep = "_(?=[^_] $)"
)
# A tibble: 246 × 5
id year ratio_EL2FC EL_count Teacher_count
<int> <chr> <dbl> <int> <int>
1 1894 2019.20 30.5 61 2
2 1894 2020.21 32 64 2
3 1894 2021.22 32.5 65 2
4 1900 2019.20 24 24 1
5 1900 2020.21 23 23 1
6 1900 2021.22 17 17 1
7 1901 2019.20 41.2 535 13
8 1901 2020.21 36.8 552 15
9 1901 2021.22 38.5 577 15
10 1923 2019.20 56.2 225 4
# … with 236 more rows
# ℹ Use `print(n = ...)` to see more rows