I Have dataframe with column having a very long text per row, looks like this:
ID | text |
---|---|
id1 | DIAGNOSTIC CEREBRAL ANGIOGRAM DATE: 8/26/2005 INDICATION: 78-year-old man with a history of shunted normal pressure hydrocephalus who more recently has been managed for a right-sided subdural hematoma. This was initially managed conservatively in the acute phase but progressed to an enlarging chronic subdural hematoma that was ultimately treated with burr hole drainage. Middle meningeal artery embolization was recommended to minimize the risk of future recurrence. COMPARISON: CT brain 8/24/2003 and CT brain MEDICATIONS: 1. Heparin 3500 units IV. 2. Nitroglycerin 200 mcg IA. 3. Verapamil 5 mg IA. 4. See anesthesia records for additional medications administered. CONTRAST: 150 mL Visipaque RADIATION DOSE: 16.3 min; 587.7 mGy IMPRESSION: Successful particle and coil embolization of the parietal branch of the right middle meningeal artery for treatment of a right-sided chronic subdural hematoma. |
I would like to split this columns into multiple columns Phrases to split on
- Starts with “DATE:”
- Starts with “Medication:”
- Starts with “ IMPRESSION:”
- Starts with “ INDICATION:”
- Starts with “ COMPARISON:”
I need the final dataframe to look like this
id | DATE | INDICATION | COMPARISON | MEDICATIONS | IMPRESSION |
---|---|---|---|---|---|
id1 | 8/26/2005 | 78-year-old man with a history of shunted normal pressure hydrocephalus who more recently has been managed for a right-sided subdural hematoma. This was initially managed conservatively in the acute phase but progressed to an enlarging chronic subdural hematoma that was ultimately treated with burr hole drainage. Middle meningeal artery embolization was recommended to minimize the risk of future recurrence. | CT brain 8/24/2003 and CT brain 8/26/2003 | 1. Heparin 3500 units IV. 2. Nitroglycerin 200 mcg IA. 3. Verapamil 5 mg IA. 4. See anesthesia records for additional medications administered. CONTRAST: 150 mL Visipaque RADIATION DOSE: 16.3 min; 587.7 mGy | Status post left pterional craniotomy for clipping of a left middle cerebral artery trifurcation aneurysm with no evidence of residual aneurysm |
CodePudding user response:
You could use pandas extract
and Python named groups
to extract only the desired parts of the paragraph.
import pandas as pd
import re
paragraphs = """DIAGNOSTIC CEREBRAL ANGIOGRAM DATE: 8/26/2005 INDICATION: 78-year-old man with a history of shunted normal pressure hydrocephalus who more recently has been managed for a right-sided subdural hematoma. This was initially managed conservatively in the acute phase but progressed to an enlarging chronic subdural hematoma that was ultimately treated with burr hole drainage. Middle meningeal artery embolization was recommended to minimize the risk of future recurrence. COMPARISON: CT brain 8/24/2003 and CT brain MEDICATIONS: 1. Heparin 3500 units IV. 2. Nitroglycerin 200 mcg IA. 3. Verapamil 5 mg IA. 4. See anesthesia records for additional medications administered. CONTRAST: 150 mL Visipaque RADIATION DOSE: 16.3 min; 587.7 mGy IMPRESSION: Successful particle and coil embolization of the parietal branch of the right middle meningeal artery for treatment of a right-sided chronic subdural hematoma."""
df = pd.DataFrame({'paragraphs':paragraphs}, index=[0])
print(df)
df1 = df['paragraphs'].str.extract(
r'(?:DATE: )(?P<DATE>. ?)\s'
r'(?:INDICATION:)(?P<INDICATION>. ?)'
r'(?:COMPARISON:)(?P<COMPARISON>. ?)'
r'(?:MEDICATIONS:)(?P<MEDICATIONS>. ?)'
r'(?:IMPRESSION:)(?P<IMPRESSION>. ?)$', flags=re.M, expand=True)
Output from df1
index | DATE | INDICATION | COMPARISON | MEDICATIONS | IMPRESSION |
---|---|---|---|---|---|
0 | 8/26/2005 | 78-year-old man with a history of shunted normal pressure hydrocephalus who more recently has been managed for a right-sided subdural hematoma. This was initially managed conservatively in the acute phase but progressed to an enlarging chronic subdural hematoma that was ultimately treated with burr hole drainage. Middle meningeal artery embolization was recommended to minimize the risk of future recurrence. | CT brain 8/24/2003 and CT brain | 1. Heparin 3500 units IV. 2. Nitroglycerin 200 mcg IA. 3. Verapamil 5 mg IA. 4. See anesthesia records for additional medications administered. CONTRAST: 150 mL Visipaque RADIATION DOSE: 16.3 min; 587.7 mGy | Successful particle and coil embolization of the parietal branch of the right middle meningeal artery for treatment of a right-sided chronic subdural hematoma. |