Home > front end >  reshape dataframe from wide to long in R
reshape dataframe from wide to long in R

Time:05-06

I have a dataframe including around 300 columns. These columns are grouped. Here is the example:

id  name fixed_2020 fixed_2019 fixed_2018 fixed_2017 fixed_2016 current_2020  current_2019 current_2018 current_2017 current_2016
 1     A       2300       2100       2600       2600       1900         3000          3100         3200         3300         3400

I have around 20000 rows in total in the dataframe. And I would like to reshape this dataframe from wide to long in R. I tried to use function melt:

fixed <- melt(mydata, id.vars = c('id', 'name'), measure.vars = 3:7, variable.name = 'year', value.name = 'fixed')

Then I use gsub to get the year column

fixed$year <- as.character(gsub("[^0-9-]", "", debtors$year))

Here is what I want:

id   name  year  fixed  current
 1   A     2020  2030    3000
 2   A     2019  2100    3100
 3   A     2018  2600    3200
 4   A     2017  2600    3300
 5   A     2016  1900    3400

While it does give me what I want, the process is time-consuming when I have more than 300 columns. Since my dataframe is grouped according to the variables and years (10 years for each variable), I wonder whether there are other functions to do it more efficiently.

Thanks a lot in advance!

CodePudding user response:

You can use pivot_longer() from tidyr:

library(dplyr)
library(tidyr)

df %>%
  pivot_longer(contains("_"), names_to = c(".value", "year"), names_sep = "_") %>%
  group_by(name) %>%
  mutate(id = 1:n()) %>%
  ungroup()

# # A tibble: 5 x 5
#      id name  year  fixed current
#   <int> <chr> <chr> <int>   <int>
# 1     1 A     2020   2300    3000
# 2     2 A     2019   2100    3100
# 3     3 A     2018   2600    3200
# 4     4 A     2017   2600    3300
# 5     5 A     2016   1900    3400

Data
df <- structure(list(id = 1L, name = "A", fixed_2020 = 2300L, fixed_2019 = 2100L, 
    fixed_2018 = 2600L, fixed_2017 = 2600L, fixed_2016 = 1900L,
    current_2020 = 3000L, current_2019 = 3100L, current_2018 = 3200L,
    current_2017 = 3300L, current_2016 = 3400L), class = "data.frame", row.names = c(NA, -1L))

CodePudding user response:

Using data.table:

library(data.table)
setDT(mydata)
result <- melt(mydata, id=c('id', 'name'), 
                 measure.vars = patterns(fixed='fixed_', current='current_'), 
                 variable.name = 'year')
years <- as.numeric(gsub('. _(\\d )', '\\1', grep('fixed', names(mydata), value = TRUE)))
result[, year:=years[year]]
result[, id:=seq(.N), by=.(name)]
result
##    id name year fixed current
## 1:  1    A 2020  2300    3000
## 2:  2    A 2019  2100    3100
## 3:  3    A 2018  2600    3200
## 4:  4    A 2017  2600    3300
## 5:  5    A 2016  1900    3400

This should be very fast but your data-set is not very big tbh.

Note that this assumes the fixed and current columns are in the same order and associated with the same year(s). So if there is a fixed_2020 as the first fixed_* column, there is also a current_2020 as the first current_* column, and so on. Otherwise, the year column will correctly associate with fixed but not current

CodePudding user response:

A base R option using reshape

transform(
    reshape(
        df,
        direction = "long",
        idvar = c("id", "name"),
        sep = "_",
        varying = -c(1:2)
    ),
    id = seq_along(id)
)

gives

         id name time fixed current
1.A.2020  1    A 2020  2300    3000
1.A.2019  2    A 2019  2100    3100
1.A.2018  3    A 2018  2600    3200
1.A.2017  4    A 2017  2600    3300
1.A.2016  5    A 2016  1900    3400
  • Related