Home > Enterprise >  Splitting long text into multiple columns with matched pharases
Splitting long text into multiple columns with matched pharases

Time:02-21

I Have a very long text and it contains the following paragragraph

"MEDICATIONS: 1. Versed 2 mg IV. 2. Fentanyl 100 mcg IV. 3. Heparin 5000 units IA. 4. Nitroglycerin 200 mcg IA. 5. Verapamil 5 mg IA. 6. Protamine 50 mg IV. CONTRAST: 61 mL Visipaque RADIATION DOSE: 10.1 min; 318 mGy
IMPRESSION: Status post left pterional craniotomy for clipping of a left middle cerebral artery trifurcation aneurysm with no evidence of residual aneurysm"

I would like to split into 2 or more columns Phrases to split on

  • Starts with “Medication:”
  • Starts with “ IMPRESSION:”

is there a way to do that using regex or spaCy in pandas?

MEDICATIONS IMPRESSION
1. Versed 2 mg IV. 2. Fentanyl 100 mcg IV. 3. Heparin 5000 units IA. 4. Nitroglycerin 200 mcg IA. 5. Verapamil 5 mg IA. 6. Protamine 50 mg IV. CONTRAST: 61 mL Visipaque RADIATION DOSE: 10.1 min; 318 mGy Status post left pterional craniotomy for clipping of a left middle cerebral artery trifurcation aneurysm with no evidence of residual aneurysm

CodePudding user response:

You could use pandas extract and Python named groups to extract only the desired parts of the paragraph.

import pandas as pd
import re

paragraphs = """MEDICATIONS: 1. Versed 2 mg IV. 2. Fentanyl 100 mcg IV. 3. Heparin 5000 units IA. 4. Nitroglycerin 200 mcg IA. 5. Verapamil 5 mg IA. 6. Protamine 50 mg IV. CONTRAST: 61 mL Visipaque RADIATION DOSE: 10.1 min; 318 mGy
IMPRESSION: Status post left pterional craniotomy for clipping of a left middle cerebral artery trifurcation aneurysm with no evidence of residual aneurysm"""

df = pd.DataFrame({'paragraphs':paragraphs}, index=[0])
print(df)

df1 = df['paragraphs'].str.extract(
    r'(?:^MEDICATIONS:)(?P<MEDICATIONS>. ?)\n'
    r'(?:^IMPRESSION:)(?P<IMPRESSION>. ?)$', flags=re.M, expand=True)

Output from df1

index MEDICATIONS IMPRESSION
0 1. Versed 2 mg IV. 2. Fentanyl 100 mcg IV. 3. Heparin 5000 units IA. 4. Nitroglycerin 200 mcg IA. 5. Verapamil 5 mg IA. 6. Protamine 50 mg IV. CONTRAST: 61 mL Visipaque RADIATION DOSE: 10.1 min; 318 mGy Status post left pterional craniotomy for clipping of a left middle cerebral artery trifurcation aneurysm with no evidence of residual aneurysm
  • Related