I'm working on a database with hundreds of variables, however, as its origin is JSON, it's making me very difficult to organize it. For example, instead of the file bringing the information in the columns, it creates new lines. See the example.
df1 <- data_frame(ID = c(111,111,111,111,111,111,222,222,333),
NAME = c('JOHN','JOHN','MARY','MARY','JAMES','JAMES','WILL','WILL','MARK'),
ADRESS = c('NY','NY','NY','NY','ROMA','ROMA','LONDON','TOKYO',''),
COLOR = c('GREEN','GREEN','RED','RED','YELLOW','YELLOW','BLUE','BLUE','ORANGE'),
CAR = c('','','BMW','BMW','TRUCK','TRUCK','FORD','FORD','FERRARI'),
COUNTRY = c('USA','USA','USA','USA','USA','USA','USA','USA','USA'))
I would like to organize the file in a way that it is grouped by ID, as in the example below:
df2 <- data_frame(ID = c(111,222,333),
NAME1 = c('JOHN','WILL','MARK'),
NAME2 = c('MARY','',''),
NAME3 = c('JAMES','',''),
ADRESS1 = c('NY','LONDON',''),
ADRESS2 = c('NY','TOKYO',''),
ADRESS3 = c('ROMA','',''),
COLOR1 = c('GREEN','BLUE','ORANGE'),
COLOR2 = c('RED','',''),
COLOR3 = c('YELLOW','',''),
CAR1 = c('','FORD','FERRARI'),
CAR2 = c('BMW','',''),
CAR3 = c('TRUCK','',''),
COUNTRY = c('USA','USA','USA'))
However, note that the COUNTRY
variable does not need to have numerous columns (COUNTRY1, COUNTRY2, COUNTRY3) as the results are repeated. In my original file, I will find numerous situations like this.
How would I arrange the data evenly in df2?
CodePudding user response:
Perhaps we can try the following base R code using reshape
u <- reshape(
transform(
unique(df1),
GRP = ave(seq_along(ID), ID, FUN = seq_along)
),
direction = "wide",
idvar = "ID",
timevar = "GRP"
)
u[order(match(gsub("\\.\\d ", "", names(u)), names(df1)))]
which gives
> u[order(match(gsub("\\.\\d ", "", names(u)), names(df1)))]
ID NAME.1 NAME.2 NAME.3 ADRESS.1 ADRESS.2 ADRESS.3 COLOR.1 COLOR.2 COLOR.3
1 111 JOHN MARY JAMES NY NY ROMA GREEN RED YELLOW
7 222 WILL WILL <NA> LONDON TOKYO <NA> BLUE BLUE <NA>
9 333 MARK <NA> <NA> <NA> <NA> ORANGE <NA> <NA>
CAR.1 CAR.2 CAR.3 COUNTRY.1 COUNTRY.2 COUNTRY.3
1 BMW TRUCK USA USA USA
7 FORD FORD <NA> USA USA <NA>
9 FERRARI <NA> <NA> USA <NA> <NA>
CodePudding user response:
An option is also with pivot_wider
library(dplyr)
library(tidyr)
library(data.table)
distinct(df1) %>%
mutate(rn = rowid(ID)) %>%
pivot_wider(names_from = rn, values_from = NAME:CAR,
names_sep = "", values_fill = "") %>%
select(-COUNTRY, COUNTRY)
-output
# A tibble: 3 × 14
ID NAME1 NAME2 NAME3 ADRESS1 ADRESS2 ADRESS3 COLOR1 COLOR2 COLOR3 CAR1 CAR2 CAR3 COUNTRY
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 111 JOHN "MARY" "JAMES" "NY" "NY" "ROMA" GREEN "RED" "YELLOW" "" "BMW" "TRUCK" USA
2 222 WILL "WILL" "" "LONDON" "TOKYO" "" BLUE "BLUE" "" "FORD" "FORD" "" USA
3 333 MARK "" "" "" "" "" ORANGE "" "" "FERRARI" "" "" USA