Home > Blockchain >  Extract rows of strings between two rows of partial string matches (not inclusive of the matching st
Extract rows of strings between two rows of partial string matches (not inclusive of the matching st

Time:12-02

I have a file that is essentially made up of rows of strings. I am trying to extract the sections of rows into individual files between rows of strings. The file looks like this:

**File Begins**

"Name: XXX_2" 
"Description:  Object 1210 , 111"
"Sampling_info: statexy=1346"
"Num value: 15"
"32 707; 33 71; 37 11; 38 3; 40 146; " 
"41 64; 42 36; 43 24; 44 69; 45 324; " 
"46 49; 47 52; 50 11; 51 90; 52 22; " 
"Name: XXX_3" 

**And then the next entry begins**

I want to get the numbers between "Num value: 15" and "Name: XXX_3" while excluding those two rows. This will be implemented into a for loop to extract all the independent entries in the file. I am just trying to get one working for now to build the for loop around

I tried str_match but it returns NA:

str_match(data, "Name: UNK_1\\s*(.*?)\\s*Name: UNK_2")

I also tried gsub but it returned the whole file...:

gsub(".*Name: UNK_1 (. ) Name: UNK_2.*", "\\1", data)

Is there something wring with my implementation of str_match and gsub?

Thank you in advance!

CodePudding user response:

One approach without loops:

library(dplyr)
library(tidyr)

df <- read.delim('path_to_input_file/your_file.txt',
                 sep = ':', header = FALSE)

df %>%
    separate(V1, into = c('param', 'value'), sep = ' *: *') %>%
    filter(param == 'Name' | grepl(';', param)) %>%
    fill(value, .direction = 'down') %>%
    filter(param != 'Name') %>%
    separate_rows(param, sep = ' *; *')

## follow up with blank removal, conversion to numeric as needed

Output (column value contains the name from the initial name: xxx lines)

# A tibble: 18 x 2
   param    value   
   <chr>    <chr>   
 1 "32 707" "XXX_2 "
 2 "33 71"  "XXX_2 "
 3 "37 11"  "XXX_2 "
 4 "38 3"   "XXX_2 "
 5 "40 146" "XXX_2 "
 6 ""       "XXX_2 "

You might want to partition the above pipeline and inspect the intermediate dataframes to see what's going on at which step.

CodePudding user response:

What about something like this:

library(tidyverse)
# Build dataset
df <- data.frame(
  col1 = c("Name: XXX_2" ,
           "Description:  Object 1210 , 111",
           "Sampling_info: statexy=1346",
           "Num value: 15",
           "32 707; 33 71; 37 11; 38 3; 40 146; " ,
           "41 64; 42 36; 43 24; 44 69; 45 324; " ,
           "46 49; 47 52; 50 11; 51 90; 52 22; " ,
           "Name: XXX_3" ,
           "Shouldn't get this number: 8675309")
)

df %>%
  # Combine row into single string
  map_chr(paste, collapse = " ") %>%
  # Remove everything before "Num value:"
  str_extract(" Num value:.*") %>%
  # Remove everything after "Name:"
  str_extract(" .*Name:") %>%
  # Extract digits
  str_extract_all("\\d ") %>%
  unlist() %>%
  as.numeric()

# 15  32 707  33  71  37  11  38   3  40 146  41  64  42  36  43  24  44  69  45 324  46  49  47  52  50  11  51  90  52  22
  • Related