Home > Software engineering >  Extract sentences with specific word/patterns in it
Extract sentences with specific word/patterns in it

Time:05-02

I´m trying to extract sentences with the word "privacy|Privacy" in it. The sentences can be found in text inside my dataframe. The text is safed as a list of multiple chr. strings, because I´m working with a bunch of different files. I can´t manage to get it to work with grep, but made it work using gsub. The problem I got now is, that it only extracts the first sentence of the text and doesn´t include the next ones. csv_edgar$privacy_1A <- gsub(".*?([^\\.]*(privacy|Privacy[^\\.]*).*","\\1", csv_edgar$item_1A, ignore.case=TRUE). Thats the code I´m using atm. Text:

The Company employs information technology systems to support its business, including ongoing phased implementation of an ERP system as part of business transformation on a worldwide basis over the next several years. Security breaches and other disruptions to the Company’s information technology infrastructure could interfere with the Company’s operations, compromise information belonging to the Company and its customers, suppliers, and employees, exposing the Company to liability which could adversely impact the Company’s business and reputation. In the ordinary course of business, the Company relies on information technology networks and systems, some of which are managed by third parties, to process, transmit and store electronic information, and to manage or support a variety of business processes and activities. Additionally, the Company collects and stores certain data, including proprietary business information, and may have access to confidential or personal information in certain of our businesses that is subject to privacy and security laws, regulations and customer-imposed controls. Despite our cybersecurity measures (including employee and third-party training, monitoring of networks and systems, and maintenance of backup and protective systems) which are continuously reviewed and upgraded, the Company’s information technology networks and infrastructure may still be vulnerable to damage, disruptions or shutdowns due to attack by hackers or breaches, employee error or malfeasance, power outages, computer viruses, telecommunication or utility failures, systems failures, service providers including cloud services, natural disasters or other catastrophic events. It is possible for such vulnerabilities to remain undetected for an extended period, up to and including several years. While we have experienced, and expect to continue to experience, these types of threats to the Company’s information technology networks and infrastructure, none of them to date has had a material impact to the Company. There may be other challenges and risks as the Company upgrades and standardizes its ERP system on a worldwide basis. Any such events could result in legal claims or proceedings, liability or penalties under privacy laws, disruption in operations, and damage to the Company’s reputation, which could adversely affect the Company’s business. Although the Company maintains insurance coverage for various cybersecurity risks, there can be no guarantee that all costs or losses incurred will be fully insured.

CodePudding user response:

You could use str_extract_all with an alternation:

regex <- "[A-Z][^.] \\b(?:Privacy|privacy)\\b[^.] \\."
sentences <- str_extract_all(input, regex)[[1]]

[1] "Additionally, the Company collects and stores certain data, including proprietary business information, and may have access to confidential or personal information in certain of our businesses that is subject to privacy and security laws, regulations and customer-imposed controls."
[2] "Any such events could result in legal claims or proceedings, liability or penalties under privacy laws, disruption in operations, and damage to the Company<U 2019>s reputation, which could adversely affect the Company<U 2019>s business."

In the snippet above, input is the sample text you provided in the question.

CodePudding user response:

Suggesting awk command:

awk '/[pP]rivacy/{print}' RS="." input.txt

Result from provided sample

 Additionally, the Company collects and stores certain data, including proprietary business information, and may have access to confidential or personal information in certain of our businesses that is subject to privacy and security laws, regulations and customer-imposed controls
 Any such events could result in legal claims or proceedings, liability or penalties under privacy laws, disruption in operations, and damage to the Company’s reputation, which could adversely affect the Company’s business      
  • Related