I have a large dataset where I have a problem similar to the one below. I have something like this:
dummydf <- data.frame(country = c("USA", "USA"),
state = c("Oregon", "California"),
name = c("anne", "paul"),
family = c("stevens", "williams"),
votes = c(10, 50.2),
city = c("london", "berlin"),
age = c(10, 50),
`name...2` = c("joseph", "vincent"),
`family...2` = c("ramos", "williams"),
`votes...2` = c(15, 62),
`city...2` = c("lisbon", "berlin"),
`age...2` = c(77, 43),
`name...3` = c("johanna", "paul"),
`family...3` = c("santos", "ramos"),
`votes...3` = c(.61, 54.2),
`city...3` = c("london", "berlin"),
`age...3` = c(56, 54),
`name...4` = c("sara", "edith"),
`family...4` = c("stevens", "sanchez"),
`votes...4` = c(2.9, 54.1),
`city...4` = c("lisbon", "paris"),
`age...4` = c(20, 25),
`name...5` = c("thomas", "paul"),
`family...5` = c("santos", "ramos"),
`votes...5` = c(1.2, 5.2),
`city...5` = c("lisbon", "toronto"),
`age...5` = c(45, 80))
or maybe this is easier to understand:
country state name family votes city age name...2 family...2 votes...2 city...2 age...2 name...3 family...3 votes...3 city...3 age...3 name...4 family...4 votes...4
1 USA Oregon anne stevens 10.0 london 10 joseph ramos 15 lisbon 77 johanna santos 0.61 london 56 sara stevens 2.9
2 USA California paul williams 50.2 berlin 50 vincent williams 62 berlin 43 paul ramos 54.20 berlin 54 edith sanchez 54.1
city...4 age...4 name...5 family...5 votes...5 city...5 age...5
1 lisbon 20 thomas santos 1.2 lisbon 45
2 paris 25 paul ramos 5.2 toronto 80
So the issue here is that all the columns except country
and state
are multipled 5 times. Ideally I want something like this:
idealdf
name family votes city age state country
1 anne stevens 10.00 london 10 Oregon USA
2 paul williams 50.20 berlin 50 California USA
3 joseph ramos 15.00 lisbon 77 Oregon USA
4 vincent williams 62.00 berlin 43 California USA
5 johanna santos 0.61 london 56 Oregon USA
6 paul ramos 54.20 berlin 54 California USA
7 sara stevens 2.90 lisbon 20 Oregon USA
8 edith sanchez 54.10 paris 25 California USA
9 thomas santos 1.20 lisbon 45 Oregon USA
10 paul ramos 5.20 toronto 80 California USA
In the original dataset I have it multiple several hundred times and not 5 but the issue is the same. I tried a few things with pivoting and trying to clean the names with regular expressions but nothing really works well. Any pointers would be amazing - thanks!
CodePudding user response:
One way could be with bringing into long form with pivot_longer
. With str_remove
we remove all ...1
etc..
Finally bring back in wide format using the trick with group_by and row_number():
library(dplyr)
library(tidyr)
library(stringr)
dummydf %>%
mutate(across(, as.character)) %>%
pivot_longer(-c(country, state)) %>%
mutate(name = str_remove(name, '\\...\\d ')) %>%
group_by(name) %>%
mutate(id = row_number()) %>%
pivot_wider(names_from = name, values_from = value) %>%
select(-id)
country state name family votes city age
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 USA Oregon anne stevens 10 london 10
2 USA Oregon joseph ramos 15 lisbon 77
3 USA Oregon johanna santos 0.61 london 56
4 USA Oregon sara stevens 2.9 lisbon 20
5 USA Oregon thomas santos 1.2 lisbon 45
6 USA California paul williams 50.2 berlin 50
7 USA California vincent williams 62 berlin 43
8 USA California paul ramos 54.2 berlin 54
9 USA California edith sanchez 54.1 paris 25
10 USA California paul ramos 5.2 toronto 80
CodePudding user response:
Base R approach (which keeps original column types):
nam <- sub('...\\d$', '', names(dummydf))
as.data.frame(sapply(unique(nam), function(x)
unlist(dummydf[nam==x], use.names=F), simplify=F))
# country state name family votes city age
# 1 USA Oregon anne stevens 10.00 london 10
# 2 USA California paul williams 50.20 berlin 50
# 3 USA Oregon joseph ramos 15.00 lisbon 77
# 4 USA California vincent williams 62.00 berlin 43
# 5 USA Oregon johanna santos 0.61 london 56
# 6 USA California paul ramos 54.20 berlin 54
# 7 USA Oregon sara stevens 2.90 lisbon 20
# 8 USA California edith sanchez 54.10 paris 25
# 9 USA Oregon thomas santos 1.20 lisbon 45
# 10 USA California paul ramos 5.20 toronto 80
Using sapply
, for each unique column name (after removing the ...n
part) the appropriate subset of dummydf
is coerced to a vector and the resulting list is again coerced to a data frame. The shorter elements of the list for country and state columns are recycled as necessary. use.names=F
in unlist
could be omitted but that would result in a warning which I'd rather avoid.