Home > other >  Merging select consecutive rows in a data frame
Merging select consecutive rows in a data frame

Time:06-17

I have a data set of the following format

    ID 12345
    NAME JohnDoe
    INFO Info about the individual
    INFO Extra info about the individual

    ID 12346
    NAME JaneDoe
    INFO Info about the individual

    ID 12347
    NAME JohnSmith
    INFO Info about the individual

Right now I have a data frame that looks like this:

Column 1 Column 2
ID. 12345
Name. JohnDoe
Info Info about the individual
Info Extra info about the individual
ID. 12346
Name. JaneDoe
Info Info about the individual
ID. 12347
Name. JohnSmith
Info Info about the individual.

Where an ID and Name can have one or more associated pieces of information and entire items are separated by a blank row.

And I want to create a table that looks like:

ID. Name. Info.
12345 JohnDoe Info about the individual, extra info...
12346 JaneDoe Info about the individual.
12347 JohnSmith Info about the individual.

So far I've been able to filter the first frame to create columns of IDs and Names, but how can I group separate infos into one string to create a column with the same number of rows as the other two?

CodePudding user response:

Using tidyverse:

library(tidyverse)
df %>%
  mutate(person = cumsum(Column_1 == 'ID.')) %>%
  pivot_wider(person, names_from = Column_1, values_from = Column_2)

# A tibble: 3 x 4
  person ID.   Name. Info 
   <int> <chr> <chr> <chr>
1      1 ID#   Name  Info 
2      2 ID#   Name  Info 
3      3 ID#   Name  Info 

EDIT:

If you have a text file just do

a <- readLines('file.txt')
read.dcf(textConnection(gsub("(?m)^(\\w ) ", '\\1:', a, perl = TRUE)), all=TRUE)
     ID      NAME                                                       INFO
1 12345   JohnDoe Info about the individual, Extra info about the individual
2 12346   JaneDoe                                  Info about the individual
3 12347 JohnSmith                                  Info about the individual

where the file.txt looks like given in the question.

With the updated df you could do:

df %>%
   filter(Column.1!='') %>%
   group_by(grp = cumsum(Column.1=='ID.'), name=Column.1) %>%
   summarise(value = toString(Column.2), .groups = 'drop') %>%
   pivot_wider(grp)

# A tibble: 3 x 4
    grp ID.   Info                                                       Name.    
  <int> <chr> <chr>                                                      <chr>    
1     1 12345 Info about the individual, Extra info about the individual JohnDoe  
2     2 12346 Info about the individual                                  JaneDoe  
3     3 12347 Info about the individual.                                 JohnSmith

CodePudding user response:

Here is an alternative approach:

library(dplyr)
library(tidyr)

df %>% 
  filter_all(any_vars(.!="")) %>% 
  group_by(Column1) %>% 
  mutate(row = row_number()) %>%
  pivot_wider(names_from = Column1, values_from = Column2, values_fn = list) %>% 
  unnest(cols = c(ID., Name., Info)) %>% 
  select(-row)
 ID.   Name.     Info                           
  <chr> <chr>     <chr>                          
1 12345 JohnDoe   Info about the individual      
2 12346 JaneDoe   Extra info about the individual
3 12347 JohnSmith Info about the individual      
4 NA    NA        Info about the individual.    
  • Related