I have an export of a data which lists a hierarchical structure but it's not really usable for my purposes. Here is an example of the data:
Level | Number |
---|---|
1 | CO123 |
2 | PN123 |
2 | PN124 |
1 | CO124 |
2 | PN125 |
2 | PN126 |
1 | CO125 |
2 | PN127 |
2 | PN128 |
I want it to look like this:
Parent | Child |
---|---|
CO123 | PN123 |
CO123 | PN124 |
CO124 | PN125 |
CO124 | PN126 |
CO125 | PN127 |
CO125 | PN128 |
I have a lot more rows with varying amounts of "children" for each parent. However, the parent only shows up in the data set once, while the children could be structured to multiple parents.
I'm trying to run some analysis on each parent, like the average number of children.
Thanks in advance.
CodePudding user response:
We may use pivot_wider
to reshape to 'wide' format
library(dplyr)
library(tidyr)
library(data.table)
df1 %>%
mutate(nm1 = case_when(Level == 1 ~ "Parent", TRUE ~ "Child"),
rn = rowid(Level)) %>%
pivot_wider(names_from = nm1, values_from = Number) %>%
fill(everything(), .direction = "downup") %>%
filter(Level != 1) %>%
select(-rn,-Level)
-output
# A tibble: 6 × 2
Parent Child
<chr> <chr>
1 CO123 PN123
2 CO123 PN124
3 CO124 PN125
4 CO124 PN126
5 CO125 PN127
6 CO125 PN128
data
df1 <- structure(list(Level = c(1L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 2L),
Number = c("CO123", "PN123", "PN124", "CO124", "PN125", "PN126",
"CO125", "PN127", "PN128")), class = "data.frame", row.names = c(NA,
-9L))
CodePudding user response:
Another option with tidyverse
using fill
and filter
:
library(tidyverse)
df %>%
mutate(Parent = ifelse(Level == 1, Number, NA)) %>%
fill(Parent, .direction = "down") %>%
filter(Level != 1) %>%
select(Parent, Child = Number)
Output
Parent Child
1 CO123 PN123
2 CO123 PN124
3 CO124 PN125
4 CO124 PN126
5 CO125 PN127
6 CO125 PN128
Data
df <- structure(list(Level = c(1L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 2L),
Number = c("CO123", "PN123", "PN124", "CO124", "PN125", "PN126",
"CO125", "PN127", "PN128")), class = "data.frame", row.names = c(NA,
-9L))
CodePudding user response:
My answer differs from the others because it is an attempt to implement a more general approach to handle multiple levels. Namely, the approach consists in finding the closest row with level
less than current level
library(dplyr)
df <- structure(list(V1 = c(1, 2, 2, 1, 2, 2, 1, 2, 2), V2 = c("CO123",
"PN123", "PN124", "CO124", "PN125", "PN126", "CO125", "PN127",
"PN128")), class = "data.frame", row.names = c(NA, -9L))
attach(df)
m <- outer(V1, V1, "<")
which(m & upper.tri(m), arr.ind = TRUE) |>
data.frame() |>
group_by(col) |>
summarize(row = max(row)) |>
transmute(parent = V2[row],
child = V2[col])
##> # A tibble: 6 × 2
##> parent child
##> <chr> <chr>
##> 1 CO123 PN123
##> 2 CO123 PN124
##> 3 CO124 PN125
##> 4 CO124 PN126
##> 5 CO125 PN127
##> 6 CO125 PN128
It also works on a multi-level hierarchy e.g.:
h <- data.frame(V1 = c(1,2,3,3,2,3,3,2,2,1,2,3),
V2 = toupper(letters[1:12]))
h
##> V1 V2
##> 1 1 A
##> 2 2 B
##> 3 3 C
##> 4 3 D
##> 5 2 E
##> 6 3 F
##> 7 3 G
##> 8 2 H
##> 9 2 I
##> 10 1 J
##> 11 2 K
##> 12 3 L
attach(h)
m <- outer(V1, V1, "<")
which(m & upper.tri(m), arr.ind = TRUE) |>
data.frame() |>
group_by(col) |>
summarize(row = max(row)) |>
transmute(parent = V2[row],
child = V2[col])
##> # A tibble: 10 × 2
##> parent child
##> <chr> <chr>
##> 1 A B
##> 2 B C
##> 3 B D
##> 4 A E
##> 5 E F
##> 6 E G
##> 7 A H
##> 8 A I
##> 9 J K
##> 10 K L