I have a question regarding Regular Expression using R.
I have the following data in a txt file.
PUBLIC DOCUMENT COUNT: 1
FILED AS OF DATE: 20090527
DATE AS OF CHANGE: 20090527
GROUP MEMBERS: CAS, LLC
GROUP MEMBERS: AAS, INC.
GROUP MEMBERS: BCC, LLC
GROUP MEMBERS: A
SUBJECT COMPANY:
COMPANY DATA:
COMPANY CONFORMED NAME: ABC INC
CENTRAL INDEX KEY: 0000123456
STANDARD INDUSTRIAL CLASSIFICATION: AGRICULTURE CHEMICALS [1000]
IRS NUMBER: 52000000
STATE OF INCORPORATION: MD
FISCAL YEAR END: 1234
From here, I would like to extract the company name "ABC INC" which is three lines below "SUBJECT COMPANY". Using "SUBJECT COMPANY" within the regular expression is important because I aim to make the code general; I need a company name that comes after the "SUBJECT COMPANY".
I tried to add something behind "(\SUBJECT\sCOMPANY)", but I couldn't come up with a nice code that captures "ABC INC".
Thank you very much in advance for your help!
CodePudding user response:
Perhaps something like this:
library(stringr)
# Read in the txt file as a vector with each row as an element
input <- readLines('myfile.txt')
# Locate the 'SUBJECT COMPANY' row
subject_company_row <- input |>
str_detect('SUBJECT COMPANY') |>
which()
# Extract the company name three rows below
input[subject_company_row 3] |>
str_extract('(?<=\\:).*') |>
str_trim()