Home > Software engineering >  Splitting long text dataframe column into multiple columns with matched pharases
Splitting long text dataframe column into multiple columns with matched pharases

Time:02-22

I Have dataframe with column having a very long text per row, looks like this:

ID text
id1 DIAGNOSTIC CEREBRAL ANGIOGRAM DATE: 8/26/2005 INDICATION: 78-year-old man with a history of shunted normal pressure hydrocephalus who more recently has been managed for a right-sided subdural hematoma. This was initially managed conservatively in the acute phase but progressed to an enlarging chronic subdural hematoma that was ultimately treated with burr hole drainage. Middle meningeal artery embolization was recommended to minimize the risk of future recurrence. COMPARISON: CT brain 8/24/2003 and CT brain MEDICATIONS: 1. Heparin 3500 units IV. 2. Nitroglycerin 200 mcg IA. 3. Verapamil 5 mg IA. 4. See anesthesia records for additional medications administered. CONTRAST: 150 mL Visipaque RADIATION DOSE: 16.3 min; 587.7 mGy IMPRESSION: Successful particle and coil embolization of the parietal branch of the right middle meningeal artery for treatment of a right-sided chronic subdural hematoma.

I would like to split this columns into multiple columns Phrases to split on

  • Starts with “DATE:”
  • Starts with “Medication:”
  • Starts with “ IMPRESSION:”
  • Starts with “ INDICATION:”
  • Starts with “ COMPARISON:”

I need the final dataframe to look like this

id DATE INDICATION COMPARISON MEDICATIONS IMPRESSION
id1 8/26/2005 78-year-old man with a history of shunted normal pressure hydrocephalus who more recently has been managed for a right-sided subdural hematoma. This was initially managed conservatively in the acute phase but progressed to an enlarging chronic subdural hematoma that was ultimately treated with burr hole drainage. Middle meningeal artery embolization was recommended to minimize the risk of future recurrence. CT brain 8/24/2003 and CT brain 8/26/2003 1. Heparin 3500 units IV. 2. Nitroglycerin 200 mcg IA. 3. Verapamil 5 mg IA. 4. See anesthesia records for additional medications administered. CONTRAST: 150 mL Visipaque RADIATION DOSE: 16.3 min; 587.7 mGy Status post left pterional craniotomy for clipping of a left middle cerebral artery trifurcation aneurysm with no evidence of residual aneurysm

CodePudding user response:

You could use pandas extract and Python named groups to extract only the desired parts of the paragraph.

import pandas as pd
import re

paragraphs = """DIAGNOSTIC CEREBRAL ANGIOGRAM DATE: 8/26/2005 INDICATION: 78-year-old man with a history of shunted normal pressure hydrocephalus who more recently has been managed for a right-sided subdural hematoma. This was initially managed conservatively in the acute phase but progressed to an enlarging chronic subdural hematoma that was ultimately treated with burr hole drainage. Middle meningeal artery embolization was recommended to minimize the risk of future recurrence. COMPARISON: CT brain 8/24/2003 and CT brain MEDICATIONS: 1. Heparin 3500 units IV. 2. Nitroglycerin 200 mcg IA. 3. Verapamil 5 mg IA. 4. See anesthesia records for additional medications administered. CONTRAST: 150 mL Visipaque RADIATION DOSE: 16.3 min; 587.7 mGy IMPRESSION: Successful particle and coil embolization of the parietal branch of the right middle meningeal artery for treatment of a right-sided chronic subdural hematoma."""

df = pd.DataFrame({'paragraphs':paragraphs}, index=[0])
print(df)

df1 = df['paragraphs'].str.extract(
    r'(?:DATE: )(?P<DATE>. ?)\s'
    r'(?:INDICATION:)(?P<INDICATION>. ?)'
    r'(?:COMPARISON:)(?P<COMPARISON>. ?)'
    r'(?:MEDICATIONS:)(?P<MEDICATIONS>. ?)'
    r'(?:IMPRESSION:)(?P<IMPRESSION>. ?)$', flags=re.M, expand=True)

Output from df1

index DATE INDICATION COMPARISON MEDICATIONS IMPRESSION
0 8/26/2005 78-year-old man with a history of shunted normal pressure hydrocephalus who more recently has been managed for a right-sided subdural hematoma. This was initially managed conservatively in the acute phase but progressed to an enlarging chronic subdural hematoma that was ultimately treated with burr hole drainage. Middle meningeal artery embolization was recommended to minimize the risk of future recurrence. CT brain 8/24/2003 and CT brain 1. Heparin 3500 units IV. 2. Nitroglycerin 200 mcg IA. 3. Verapamil 5 mg IA. 4. See anesthesia records for additional medications administered. CONTRAST: 150 mL Visipaque RADIATION DOSE: 16.3 min; 587.7 mGy Successful particle and coil embolization of the parietal branch of the right middle meningeal artery for treatment of a right-sided chronic subdural hematoma.
  • Related