Extract rows of strings between two rows of partial string matches (not inclusive of the matching st-CodePudding

I have a file that is essentially made up of rows of strings. I am trying to extract the sections of rows into individual files between rows of strings. The file looks like this:

**File Begins**

"Name: XXX_2" 
"Description:  Object 1210 , 111"
"Sampling_info: statexy=1346"
"Num value: 15"
"32 707; 33 71; 37 11; 38 3; 40 146; " 
"41 64; 42 36; 43 24; 44 69; 45 324; " 
"46 49; 47 52; 50 11; 51 90; 52 22; " 
"Name: XXX_3" 

**And then the next entry begins**

I want to get the numbers between "Num value: 15" and "Name: XXX_3" while excluding those two rows. This will be implemented into a for loop to extract all the independent entries in the file. I am just trying to get one working for now to build the for loop around

I tried str_match but it returns NA:

str_match(data, "Name: UNK_1\\s*(.*?)\\s*Name: UNK_2")

I also tried gsub but it returned the whole file...:

gsub(".*Name: UNK_1 (. ) Name: UNK_2.*", "\\1", data)

Is there something wring with my implementation of str_match and gsub?

Thank you in advance!

CodePudding user response：

One approach without loops:

library(dplyr)
library(tidyr)

df <- read.delim('path_to_input_file/your_file.txt',
                 sep = ':', header = FALSE)

df %>%
    separate(V1, into = c('param', 'value'), sep = ' *: *') %>%
    filter(param == 'Name' | grepl(';', param)) %>%
    fill(value, .direction = 'down') %>%
    filter(param != 'Name') %>%
    separate_rows(param, sep = ' *; *')

## follow up with blank removal, conversion to numeric as needed

Output (column value contains the name from the initial name: xxx lines)

# A tibble: 18 x 2
   param    value   
   <chr>    <chr>   
 1 "32 707" "XXX_2 "
 2 "33 71"  "XXX_2 "
 3 "37 11"  "XXX_2 "
 4 "38 3"   "XXX_2 "
 5 "40 146" "XXX_2 "
 6 ""       "XXX_2 "

You might want to partition the above pipeline and inspect the intermediate dataframes to see what's going on at which step.

CodePudding user response：

What about something like this:

library(tidyverse)
# Build dataset
df <- data.frame(
  col1 = c("Name: XXX_2" ,
           "Description:  Object 1210 , 111",
           "Sampling_info: statexy=1346",
           "Num value: 15",
           "32 707; 33 71; 37 11; 38 3; 40 146; " ,
           "41 64; 42 36; 43 24; 44 69; 45 324; " ,
           "46 49; 47 52; 50 11; 51 90; 52 22; " ,
           "Name: XXX_3" ,
           "Shouldn't get this number: 8675309")
)

df %>%
  # Combine row into single string
  map_chr(paste, collapse = " ") %>%
  # Remove everything before "Num value:"
  str_extract(" Num value:.*") %>%
  # Remove everything after "Name:"
  str_extract(" .*Name:") %>%
  # Extract digits
  str_extract_all("\\d ") %>%
  unlist() %>%
  as.numeric()

# 15  32 707  33  71  37  11  38   3  40 146  41  64  42  36  43  24  44  69  45 324  46  49  47  52  50  11  51  90  52  22