Home > Back-end >  How to remove single alphabet and dot from sentence using regex
How to remove single alphabet and dot from sentence using regex

Time:10-20

 sentence = "Diagnosis: B. 
 Prostate, Left Lateral Mid, Core Biopsy: - Prostatic adenocarcinoma, Gleason's score 3 3=6/10 - Single focus of carcinoma measures 0.5 mm (involves 1 of 1 core fragment and up to 5% of individual core volume) - Prostatic intraepithelial neoplasia (PIN high grade C. 
 Prostate, Left Lateral Apex, Core Biopsy: - Prostatic "

Required solution: Diagnosis: Prostate, Left Lateral Mid, Core Biopsy: - Prostatic adenocarcinoma, Gleason's score 3 3=6/10 - Single focus of carcinoma measures 0.5 mm (involves 1 of 1 core fragment and up to 5% of individual core volume) - Prostatic intraepithelial neoplasia (PIN high grade 
     Prostate, Left Lateral Apex, Core Biopsy: - Prostatic

Is there any solution to find a single alphabet and dot Eg: "B." from sentence and remove it. I just get confused with the regex. I tried some pattern pattern like [^A-Za-z]{0,}c[,.;\s]{0,}, but it doesn't work yet.

CodePudding user response:

You are on the right track with something based on regex. I think you can use re.sub() and a simple pattern that expects whitespace then a alpha character and a period.

import re

text = "Diagnosis: B. Prostate, Left Lateral Mid, Core Biopsy: - Prostatic adenocarcinoma, Gleason's score 3 3=6/10 - Single focus of carcinoma measures 0.5 mm (involves 1 of 1 core fragment and up to 5% of individual core volume) - Prostatic intraepithelial neoplasia (PIN high grade C. Prostate, Left Lateral Apex, Core Biopsy: - Prostatic"
pattern = r"\s[a-zA-Z]\."

print(re.sub(pattern, "", text))

That should give you:

Diagnosis: Prostate, Left Lateral Mid, Core Biopsy: - Prostatic adenocarcinoma, Gleason's score 3 3=6/10 - Single focus of carcinoma measures 0.5 mm (involves 1 of 1 core fragment and up to 5% of individual core volume) - Prostatic intraepithelial neoplasia (PIN high grade Prostate, Left Lateral Apex, Core Biopsy: - Prostatic

Note that "B." and a "C." where both removed. I hope that is what you are looking for. If not, you might add a count=1 as a parameter to just remove the "B."

  • Related