I have a dataframe including around 300 columns. These columns are grouped. Here is the example:
id name fixed_2020 fixed_2019 fixed_2018 fixed_2017 fixed_2016 current_2020 current_2019 current_2018 current_2017 current_2016
1 A 2300 2100 2600 2600 1900 3000 3100 3200 3300 3400
I have around 20000 rows in total in the dataframe. And I would like to reshape this dataframe from wide to long in R. I tried to use function melt:
fixed <- melt(mydata, id.vars = c('id', 'name'), measure.vars = 3:7, variable.name = 'year', value.name = 'fixed')
Then I use gsub to get the year column
fixed$year <- as.character(gsub("[^0-9-]", "", debtors$year))
Here is what I want:
id name year fixed current
1 A 2020 2030 3000
2 A 2019 2100 3100
3 A 2018 2600 3200
4 A 2017 2600 3300
5 A 2016 1900 3400
While it does give me what I want, the process is time-consuming when I have more than 300 columns. Since my dataframe is grouped according to the variables and years (10 years for each variable), I wonder whether there are other functions to do it more efficiently.
Thanks a lot in advance!
CodePudding user response:
You can use pivot_longer()
from tidyr
:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(contains("_"), names_to = c(".value", "year"), names_sep = "_") %>%
group_by(name) %>%
mutate(id = 1:n()) %>%
ungroup()
# # A tibble: 5 x 5
# id name year fixed current
# <int> <chr> <chr> <int> <int>
# 1 1 A 2020 2300 3000
# 2 2 A 2019 2100 3100
# 3 3 A 2018 2600 3200
# 4 4 A 2017 2600 3300
# 5 5 A 2016 1900 3400
Data
df <- structure(list(id = 1L, name = "A", fixed_2020 = 2300L, fixed_2019 = 2100L,
fixed_2018 = 2600L, fixed_2017 = 2600L, fixed_2016 = 1900L,
current_2020 = 3000L, current_2019 = 3100L, current_2018 = 3200L,
current_2017 = 3300L, current_2016 = 3400L), class = "data.frame", row.names = c(NA, -1L))
CodePudding user response:
Using data.table
:
library(data.table)
setDT(mydata)
result <- melt(mydata, id=c('id', 'name'),
measure.vars = patterns(fixed='fixed_', current='current_'),
variable.name = 'year')
years <- as.numeric(gsub('. _(\\d )', '\\1', grep('fixed', names(mydata), value = TRUE)))
result[, year:=years[year]]
result[, id:=seq(.N), by=.(name)]
result
## id name year fixed current
## 1: 1 A 2020 2300 3000
## 2: 2 A 2019 2100 3100
## 3: 3 A 2018 2600 3200
## 4: 4 A 2017 2600 3300
## 5: 5 A 2016 1900 3400
This should be very fast but your data-set is not very big tbh.
Note that this assumes the fixed and current columns are in the same order and associated with the same year(s). So if there is a fixed_2020
as the first fixed_*
column, there is also a current_2020
as the first current_*
column, and so on. Otherwise, the year
column will correctly associate with fixed
but not current
CodePudding user response:
A base R option using reshape
transform(
reshape(
df,
direction = "long",
idvar = c("id", "name"),
sep = "_",
varying = -c(1:2)
),
id = seq_along(id)
)
gives
id name time fixed current
1.A.2020 1 A 2020 2300 3000
1.A.2019 2 A 2019 2100 3100
1.A.2018 3 A 2018 2600 3200
1.A.2017 4 A 2017 2600 3300
1.A.2016 5 A 2016 1900 3400