Home > database >  How can I filter out sentences from a text file with specific word list on R or Python?
How can I filter out sentences from a text file with specific word list on R or Python?

Time:01-18

I struggle to properly filter out sentences from EDGAR S-1 financial disclosures with a specific term list in RStudio.

An example text from an S-1 filing.

"We run the online operations of our institutions on different platforms, which are in various stages of development. 

The performance and reliability of these online operations are critical to the reputation of our institutions and our ability to attract and retain students. 

Any computer system error or failure, or a sudden and significant increase in traffic on our institutions' computer networks may result in the unavailability of these computer networks.

In addition, any significant failure of our computer networks could disrupt our on-campus operations.

Individual, sustained or repeated occurrences could significantly damage the reputation of our institutions' operations and result in a loss of potential or existing students.

Additionally, the computer systems and operations of our institutions are vulnerable to interruption or malfunction due to events beyond our control, including natural disasters and other catastrophic events and network and telecommunications failures.

The disaster recovery plans and backup systems that we have in place may not be effective in addressing a natural disaster or catastrophic event that results in the destruction or disruption of any of our critical business or information technology and infrastructure systems.

As a result of any of these events, we may not be able to conduct normal business operations and may be required to incur significant expenses in order to resume normal business operations.

As a result, our revenues and profitability may be materially adversely affected."

An example term list can be something from the below vector.

terms_list = c("institutions", "disaster", "error",...)

The point is to edit and overwrite the current text files to remove sentences that do not include specific words or terms, such as the ones mentioned.

After filtering and overwriting, the text should be something like this below.

"We run the online operations of our institutions on different platforms, which are in various stages of development. 

The performance and reliability of these online operations are critical to the reputation of our institutions and our ability to attract and retain students. 

Any computer system error or failure, or a sudden and significant increase in traffic on our institutions' computer networks may result in the unavailability of these computer networks. 

Individual, sustained or repeated occurrences could significantly damage the reputation of our institutions' operations and result in a loss of potential or existing students. 

Additionally, the computer systems and operations of our institutions are vulnerable to interruption or malfunction due to events beyond our control, including natural disasters and other catastrophic events and network and telecommunications failures. 

The disaster recovery plans and backup systems that we have in place may not be effective in addressing a natural disaster or catastrophic event that results in the destruction or disruption of any of our critical business or information technology and infrastructure systems. "

CodePudding user response:

If your data are one long string, in R you can:

  1. Split the string using string::str_split
  2. Use paste to combine the search terms
  3. Recombine the string

An example using your data, read in as:

strng <- "We run the online operations of our institutions on different platforms, which are in various stages of development. 

The performance and reliability of these online operations are critical to the reputation of our institutions and our ability to attract and retain students. 

Any computer system error or failure, or a sudden and significant increase in traffic on our institutions' computer networks may result in the unavailability of these computer networks.

In addition, any significant failure of our computer networks could disrupt our on-campus operations.

Individual, sustained or repeated occurrences could significantly damage the reputation of our institutions' operations and result in a loss of potential or existing students.

Additionally, the computer systems and operations of our institutions are vulnerable to interruption or malfunction due to events beyond our control, including natural disasters and other catastrophic events and network and telecommunications failures.

The disaster recovery plans and backup systems that we have in place may not be effective in addressing a natural disaster or catastrophic event that results in the destruction or disruption of any of our critical business or information technology and infrastructure systems.

As a result of any of these events, we may not be able to conduct normal business operations and may be required to incur significant expenses in order to resume normal business operations.

As a result, our revenues and profitability may be materially adversely affected."

Here each sentence is separated by \n\n. So we can split the string on that pattern. If there is another pattern in your actual data, just replace (i.e., a period).

strngSplit <- stringr::str_split(strng, "\\\n\\\n")[[1]]

# [1] "We run the online operations of our institutions on different platforms, which are in various stages of development. "                                                                                                                                                               
# [2] "The performance and reliability of these online operations are critical to the reputation of our institutions and our ability to attract and retain students. "                                                                                                                      
# [3] "Any computer system error or failure, or a sudden and significant increase in traffic on our institutions' computer networks may result in the unavailability of these computer networks."                                                                                           
# [4] "In addition, any significant failure of our computer networks could disrupt our on-campus operations."                                                                                                                                                                               
# [5] "Individual, sustained or repeated occurrences could significantly damage the reputation of our institutions' operations and result in a loss of potential or existing students."                                                                                                     
# [6] "Additionally, the computer systems and operations of our institutions are vulnerable to interruption or malfunction due to events beyond our control, including natural disasters and other catastrophic events and network and telecommunications failures."                        
# [7] "The disaster recovery plans and backup systems that we have in place may not be effective in addressing a natural disaster or catastrophic event that results in the destruction or disruption of any of our critical business or information technology and infrastructure systems."
# [8] "As a result of any of these events, we may not be able to conduct normal business operations and may be required to incur significant expenses in order to resume normal business operations."                                                                                       
# [9] "As a result, our revenues and profitability may be materially adversely affected."  

Determine search terms

terms_list <- c("institutions", "disaster", "error")

Find sentences with search terms

idx <- grep(paste0(terms_list, collapse = "|"), strngSplit)
# [1] 1 2 3 5 6 7

You could keep it as a vector (every sentence in a position of a vector) or combine it back to a paragraph with:

strngVec <- strngSplit[idx]
# [1] "We run the online operations of our institutions on different platforms, which are in various stages of development. "                                                                                                                                                               
# [2] "The performance and reliability of these online operations are critical to the reputation of our institutions and our ability to attract and retain students. "                                                                                                                      
# [3] "Any computer system error or failure, or a sudden and significant increase in traffic on our institutions' computer networks may result in the unavailability of these computer networks."                                                                                           
# [4] "Individual, sustained or repeated occurrences could significantly damage the reputation of our institutions' operations and result in a loss of potential or existing students."                                                                                                     
# [5] "Additionally, the computer systems and operations of our institutions are vulnerable to interruption or malfunction due to events beyond our control, including natural disasters and other catastrophic events and network and telecommunications failures."                        
# [6] "The disaster recovery plans and backup systems that we have in place may not be effective in addressing a natural disaster or catastrophic event that results in the destruction or disruption of any of our critical business or information technology and infrastructure systems."

# or

strngParagraph <- paste(strngSplit[idx], collapse = "\n\n")
#[1] "We run the online operations of our institutions on different platforms, which are in various stages of development. \n\nThe performance and reliability of these online operations are critical to the reputation of our institutions and our ability to attract and retain students. \n\nAny computer system error or failure, or a sudden and significant increase in traffic on our institutions' computer networks may result in the unavailability of these computer networks.\n\nIndividual, sustained or repeated occurrences could significantly damage the reputation of our institutions' operations and result in a loss of potential or existing students.\n\nAdditionally, the computer systems and operations of our institutions are vulnerable to interruption or malfunction due to events beyond our control, including natural disasters and other catastrophic events and network and telecommunications failures.\n\nThe disaster recovery plans and backup systems that we have in place may not be effective in addressing a natural disaster or catastrophic event that results in the destruction or disruption of any of our critical business or information technology and infrastructure systems."
  • Related