Home > front end >  How to use a loop to create panel data by subsetting and merging a lot of different data frames in R
How to use a loop to create panel data by subsetting and merging a lot of different data frames in R

Time:05-05

I've looked around but I can't find an answer to this!

I've imported a large number of datasets to R.

Each dataset contains information for a single year (ex. df_2012, df_2013, df_2014 etc).

All the datasets have the same variables/columns (ex. varA_2012 in df_2012 corresponds to varA_2013 in df_2013).

I want to create a df with my id variable and varA_2012, varB_2012, varA_2013, varB_2013, varA_2014, varB_2014 etc

I'm trying to create a loop that helps me extract the few columns that I'm interested in (varA_XXXX, varB_XXXX) in each data frame and then do a full join based on my id var.

I haven't used R in a very long time...

So far, I've tried this:

id <- c("France", "Belgium", "Spain")

varA_2012 <- c(1,2,3)
varB_2012 <- c(7,2,9)
varC_2012 <- c(1,56,0)
varD_2012 <- c(13,55,8)

varA_2013 <- c(34,3,56)
varB_2013 <- c(2,53,5)
varC_2013 <- c(24,3,45)
varD_2013 <- c(27,13,8)

varA_2014 <- c(9,10,5)
varB_2014 <- c(95,30,75)
varC_2014 <- c(99,0,51)
varD_2014 <- c(9,40,1)


df_2012 <-data.frame(id, varA_2012, varB_2012, varC_2012, varD_2012)

df_2013 <-data.frame(id, varA_2013, varB_2013, varC_2013, varD_2013)

df_2014 <-data.frame(id, varA_2014, varB_2014, varC_2014, varD_2014)

year = c(2012:2014)

for(i in 1:length(year)) { 

df_[i] <- df_[I][df_[i]$id, df_[i]$varA_[i], df_[i]$varB_[i], ]

  list2env(df_[i], .GlobalEnv)
}

panel_df <- Reduce(function(x, y) merge(x, y, by="if"), list(df_2012, df_2013, df_2014))

I know that there are probably loads of errors in here.

CodePudding user response:

Here are a couple of options; however, it's unclear what you want the expected output to look like.

If you want a wide format, then we can use tidyverse to do:

library(tidyverse)

results <-
  map(list(df_2012, df_2013, df_2014), function(x)
    x %>% dplyr::select(id, starts_with("varA"), starts_with("varB"))) %>%
  reduce(., function(x, y)
    left_join(x, y, all = TRUE, by = "id"))

Output

       id varA_2012 varB_2012 varA_2013 varB_2013 varA_2014 varB_2014
1 Belgium         2         2         3        53        10        30
2  France         1         7        34         2         9        95
3   Spain         3         9        56         5         5        75

However, if you need it in a long format, then we could pivot the data:

results %>%
  pivot_longer(-id, names_to = c("variable", "year"), names_sep = "_")

Output

   id      variable year  value
   <chr>   <chr>    <chr> <dbl>
 1 France  varA     2012      1
 2 France  varB     2012      7
 3 France  varA     2013     34
 4 France  varB     2013      2
 5 France  varA     2014      9
 6 France  varB     2014     95
 7 Belgium varA     2012      2
 8 Belgium varB     2012      2
 9 Belgium varA     2013      3
10 Belgium varB     2013     53
11 Belgium varA     2014     10
12 Belgium varB     2014     30
13 Spain   varA     2012      3
14 Spain   varB     2012      9
15 Spain   varA     2013     56
16 Spain   varB     2013      5
17 Spain   varA     2014      5
18 Spain   varB     2014     75

Or if using base R for the wide format, then we can do:

results <-
  lapply(list(df_2012, df_2013, df_2014), function(x)
    subset(x, select = c("id", names(x)[startsWith(names(x), "varA")], names(x)[startsWith(names(x), "varB")])))

results <-
  Reduce(function(x, y)
    merge(x, y, all = TRUE, by = "id"), results)

CodePudding user response:

From your initial for loop attempt, it seems the code below may help

> (df <- Reduce(merge, list(df_2012, df_2013, df_2014)))[grepl("^(id|var(A|B))",names(df))]
       id varA_2012 varB_2012 varA_2013 varB_2013 varA_2014 varB_2014
1 Belgium         2         2         3        53        10        30
2  France         1         7        34         2         9        95
3   Spain         3         9        56         5         5        75
  • Related