Home > Software design >  How to collapse only some duplicated columns in R
How to collapse only some duplicated columns in R

Time:11-13

I have a large dataset where I have a problem similar to the one below. I have something like this:

dummydf <- data.frame(country = c("USA", "USA"),
                      state = c("Oregon", "California"),
                      name = c("anne", "paul"),
                      family = c("stevens", "williams"),
                      votes = c(10, 50.2),
                      city = c("london", "berlin"),
                      age = c(10, 50),
                      `name...2` = c("joseph", "vincent"),
                      `family...2` = c("ramos", "williams"),
                      `votes...2` = c(15, 62),
                      `city...2` = c("lisbon", "berlin"),
                      `age...2` = c(77, 43),
                      `name...3` = c("johanna", "paul"),
                      `family...3` = c("santos", "ramos"),
                      `votes...3` = c(.61, 54.2),
                      `city...3` = c("london", "berlin"),
                      `age...3` = c(56, 54),
                      `name...4` = c("sara", "edith"),
                      `family...4` = c("stevens", "sanchez"),
                      `votes...4` = c(2.9, 54.1),
                      `city...4` = c("lisbon", "paris"),
                      `age...4` = c(20, 25),
                      `name...5` = c("thomas", "paul"),
                      `family...5` = c("santos", "ramos"),
                      `votes...5` = c(1.2, 5.2),
                      `city...5` = c("lisbon", "toronto"),
                      `age...5` = c(45, 80))

or maybe this is easier to understand:

country      state name   family votes   city age name...2 family...2 votes...2 city...2 age...2 name...3 family...3 votes...3 city...3 age...3 name...4 family...4 votes...4
1     USA     Oregon anne  stevens  10.0 london  10   joseph      ramos        15   lisbon      77  johanna     santos      0.61   london      56     sara    stevens       2.9
2     USA California paul williams  50.2 berlin  50  vincent   williams        62   berlin      43     paul      ramos     54.20   berlin      54    edith    sanchez      54.1
  city...4 age...4 name...5 family...5 votes...5 city...5 age...5
1   lisbon      20   thomas     santos       1.2   lisbon      45
2    paris      25     paul      ramos       5.2  toronto      80

So the issue here is that all the columns except country and state are multipled 5 times. Ideally I want something like this:

idealdf
      name   family votes    city age      state country
1     anne  stevens 10.00  london  10     Oregon     USA
2     paul williams 50.20  berlin  50 California     USA
3   joseph    ramos 15.00  lisbon  77     Oregon     USA
4  vincent williams 62.00  berlin  43 California     USA
5  johanna   santos  0.61  london  56     Oregon     USA
6     paul    ramos 54.20  berlin  54 California     USA
7     sara  stevens  2.90  lisbon  20     Oregon     USA
8    edith  sanchez 54.10   paris  25 California     USA
9   thomas   santos  1.20  lisbon  45     Oregon     USA
10    paul    ramos  5.20 toronto  80 California     USA

In the original dataset I have it multiple several hundred times and not 5 but the issue is the same. I tried a few things with pivoting and trying to clean the names with regular expressions but nothing really works well. Any pointers would be amazing - thanks!

CodePudding user response:

One way could be with bringing into long form with pivot_longer. With str_remove we remove all ...1 etc.. Finally bring back in wide format using the trick with group_by and row_number():

library(dplyr)
library(tidyr)
library(stringr)

dummydf %>% 
  mutate(across(, as.character)) %>% 
  pivot_longer(-c(country, state)) %>% 
  mutate(name = str_remove(name, '\\...\\d ')) %>% 
  group_by(name) %>% 
  mutate(id = row_number()) %>% 
  pivot_wider(names_from = name, values_from = value) %>% 
  select(-id)
  country state      name    family   votes city    age  
   <chr>   <chr>      <chr>   <chr>    <chr> <chr>   <chr>
 1 USA     Oregon     anne    stevens  10    london  10   
 2 USA     Oregon     joseph  ramos    15    lisbon  77   
 3 USA     Oregon     johanna santos   0.61  london  56   
 4 USA     Oregon     sara    stevens  2.9   lisbon  20   
 5 USA     Oregon     thomas  santos   1.2   lisbon  45   
 6 USA     California paul    williams 50.2  berlin  50   
 7 USA     California vincent williams 62    berlin  43   
 8 USA     California paul    ramos    54.2  berlin  54   
 9 USA     California edith   sanchez  54.1  paris   25   
10 USA     California paul    ramos    5.2   toronto 80  

CodePudding user response:

Base R approach (which keeps original column types):

nam <- sub('...\\d$', '', names(dummydf))
as.data.frame(sapply(unique(nam), function(x) 
  unlist(dummydf[nam==x], use.names=F), simplify=F))

#    country      state    name   family votes    city age
# 1      USA     Oregon    anne  stevens 10.00  london  10
# 2      USA California    paul williams 50.20  berlin  50
# 3      USA     Oregon  joseph    ramos 15.00  lisbon  77
# 4      USA California vincent williams 62.00  berlin  43
# 5      USA     Oregon johanna   santos  0.61  london  56
# 6      USA California    paul    ramos 54.20  berlin  54
# 7      USA     Oregon    sara  stevens  2.90  lisbon  20
# 8      USA California   edith  sanchez 54.10   paris  25
# 9      USA     Oregon  thomas   santos  1.20  lisbon  45
# 10     USA California    paul    ramos  5.20 toronto  80

Using sapply, for each unique column name (after removing the ...n part) the appropriate subset of dummydf is coerced to a vector and the resulting list is again coerced to a data frame. The shorter elements of the list for country and state columns are recycled as necessary. use.names=F in unlist could be omitted but that would result in a warning which I'd rather avoid.

  • Related