Home > Software design >  Matching / Finding characters in R using Regular Expression
Matching / Finding characters in R using Regular Expression

Time:01-10

I have a question regarding Regular Expression using R.

I have the following data in a txt file.

    PUBLIC DOCUMENT COUNT:      1
FILED AS OF DATE:       20090527
DATE AS OF CHANGE:      20090527
GROUP MEMBERS:      CAS, LLC
GROUP MEMBERS:      AAS, INC.
GROUP MEMBERS:      BCC, LLC
GROUP MEMBERS:      A

SUBJECT COMPANY:    

    COMPANY DATA:   
        COMPANY CONFORMED NAME:         ABC INC
        CENTRAL INDEX KEY:          0000123456
        STANDARD INDUSTRIAL CLASSIFICATION: AGRICULTURE CHEMICALS [1000]
        IRS NUMBER:             52000000
        STATE OF INCORPORATION:         MD
        FISCAL YEAR END:            1234

From here, I would like to extract the company name "ABC INC" which is three lines below "SUBJECT COMPANY". Using "SUBJECT COMPANY" within the regular expression is important because I aim to make the code general; I need a company name that comes after the "SUBJECT COMPANY".

I tried to add something behind "(\SUBJECT\sCOMPANY)", but I couldn't come up with a nice code that captures "ABC INC".

Thank you very much in advance for your help!

CodePudding user response:

Perhaps something like this:

library(stringr)

# Read in the txt file as a vector with each row as an element
input <- readLines('myfile.txt')

# Locate the 'SUBJECT COMPANY' row
subject_company_row <- input |>
  str_detect('SUBJECT COMPANY') |>
  which()

# Extract the company name three rows below
input[subject_company_row   3] |>
  str_extract('(?<=\\:).*') |>
  str_trim()
  • Related