I have a data set of the following format
ID 12345
NAME JohnDoe
INFO Info about the individual
INFO Extra info about the individual
ID 12346
NAME JaneDoe
INFO Info about the individual
ID 12347
NAME JohnSmith
INFO Info about the individual
Right now I have a data frame that looks like this:
Column 1 | Column 2 |
---|---|
ID. | 12345 |
Name. | JohnDoe |
Info | Info about the individual |
Info | Extra info about the individual |
ID. | 12346 |
Name. | JaneDoe |
Info | Info about the individual |
ID. | 12347 |
Name. | JohnSmith |
Info | Info about the individual. |
Where an ID and Name can have one or more associated pieces of information and entire items are separated by a blank row.
And I want to create a table that looks like:
ID. | Name. | Info. |
---|---|---|
12345 | JohnDoe | Info about the individual, extra info... |
12346 | JaneDoe | Info about the individual. |
12347 | JohnSmith | Info about the individual. |
So far I've been able to filter the first frame to create columns of IDs and Names, but how can I group separate infos into one string to create a column with the same number of rows as the other two?
CodePudding user response:
Using tidyverse
:
library(tidyverse)
df %>%
mutate(person = cumsum(Column_1 == 'ID.')) %>%
pivot_wider(person, names_from = Column_1, values_from = Column_2)
# A tibble: 3 x 4
person ID. Name. Info
<int> <chr> <chr> <chr>
1 1 ID# Name Info
2 2 ID# Name Info
3 3 ID# Name Info
EDIT:
If you have a text file just do
a <- readLines('file.txt')
read.dcf(textConnection(gsub("(?m)^(\\w ) ", '\\1:', a, perl = TRUE)), all=TRUE)
ID NAME INFO
1 12345 JohnDoe Info about the individual, Extra info about the individual
2 12346 JaneDoe Info about the individual
3 12347 JohnSmith Info about the individual
where the file.txt
looks like given in the question.
With the updated df
you could do:
df %>%
filter(Column.1!='') %>%
group_by(grp = cumsum(Column.1=='ID.'), name=Column.1) %>%
summarise(value = toString(Column.2), .groups = 'drop') %>%
pivot_wider(grp)
# A tibble: 3 x 4
grp ID. Info Name.
<int> <chr> <chr> <chr>
1 1 12345 Info about the individual, Extra info about the individual JohnDoe
2 2 12346 Info about the individual JaneDoe
3 3 12347 Info about the individual. JohnSmith
CodePudding user response:
Here is an alternative approach:
library(dplyr)
library(tidyr)
df %>%
filter_all(any_vars(.!="")) %>%
group_by(Column1) %>%
mutate(row = row_number()) %>%
pivot_wider(names_from = Column1, values_from = Column2, values_fn = list) %>%
unnest(cols = c(ID., Name., Info)) %>%
select(-row)
ID. Name. Info
<chr> <chr> <chr>
1 12345 JohnDoe Info about the individual
2 12346 JaneDoe Extra info about the individual
3 12347 JohnSmith Info about the individual
4 NA NA Info about the individual.