I have a file that is essentially made up of rows of strings. I am trying to extract the sections of rows into individual files between rows of strings. The file looks like this:
**File Begins**
"Name: XXX_2"
"Description: Object 1210 , 111"
"Sampling_info: statexy=1346"
"Num value: 15"
"32 707; 33 71; 37 11; 38 3; 40 146; "
"41 64; 42 36; 43 24; 44 69; 45 324; "
"46 49; 47 52; 50 11; 51 90; 52 22; "
"Name: XXX_3"
**And then the next entry begins**
I want to get the numbers between "Num value: 15" and "Name: XXX_3" while excluding those two rows. This will be implemented into a for loop to extract all the independent entries in the file. I am just trying to get one working for now to build the for loop around
I tried str_match but it returns NA:
str_match(data, "Name: UNK_1\\s*(.*?)\\s*Name: UNK_2")
I also tried gsub but it returned the whole file...:
gsub(".*Name: UNK_1 (. ) Name: UNK_2.*", "\\1", data)
Is there something wring with my implementation of str_match and gsub?
Thank you in advance!
CodePudding user response:
One approach without loops:
library(dplyr)
library(tidyr)
df <- read.delim('path_to_input_file/your_file.txt',
sep = ':', header = FALSE)
df %>%
separate(V1, into = c('param', 'value'), sep = ' *: *') %>%
filter(param == 'Name' | grepl(';', param)) %>%
fill(value, .direction = 'down') %>%
filter(param != 'Name') %>%
separate_rows(param, sep = ' *; *')
## follow up with blank removal, conversion to numeric as needed
Output (column value
contains the name from the initial name: xxx lines)
# A tibble: 18 x 2
param value
<chr> <chr>
1 "32 707" "XXX_2 "
2 "33 71" "XXX_2 "
3 "37 11" "XXX_2 "
4 "38 3" "XXX_2 "
5 "40 146" "XXX_2 "
6 "" "XXX_2 "
You might want to partition the above pipeline and inspect the intermediate dataframes to see what's going on at which step.
CodePudding user response:
What about something like this:
library(tidyverse)
# Build dataset
df <- data.frame(
col1 = c("Name: XXX_2" ,
"Description: Object 1210 , 111",
"Sampling_info: statexy=1346",
"Num value: 15",
"32 707; 33 71; 37 11; 38 3; 40 146; " ,
"41 64; 42 36; 43 24; 44 69; 45 324; " ,
"46 49; 47 52; 50 11; 51 90; 52 22; " ,
"Name: XXX_3" ,
"Shouldn't get this number: 8675309")
)
df %>%
# Combine row into single string
map_chr(paste, collapse = " ") %>%
# Remove everything before "Num value:"
str_extract(" Num value:.*") %>%
# Remove everything after "Name:"
str_extract(" .*Name:") %>%
# Extract digits
str_extract_all("\\d ") %>%
unlist() %>%
as.numeric()
# 15 32 707 33 71 37 11 38 3 40 146 41 64 42 36 43 24 44 69 45 324 46 49 47 52 50 11 51 90 52 22