sentence = "Diagnosis: B.
Prostate, Left Lateral Mid, Core Biopsy: - Prostatic adenocarcinoma, Gleason's score 3 3=6/10 - Single focus of carcinoma measures 0.5 mm (involves 1 of 1 core fragment and up to 5% of individual core volume) - Prostatic intraepithelial neoplasia (PIN high grade C.
Prostate, Left Lateral Apex, Core Biopsy: - Prostatic "
Required solution: Diagnosis: Prostate, Left Lateral Mid, Core Biopsy: - Prostatic adenocarcinoma, Gleason's score 3 3=6/10 - Single focus of carcinoma measures 0.5 mm (involves 1 of 1 core fragment and up to 5% of individual core volume) - Prostatic intraepithelial neoplasia (PIN high grade
Prostate, Left Lateral Apex, Core Biopsy: - Prostatic
Is there any solution to find a single alphabet and dot Eg: "B." from sentence and remove it. I just get confused with the regex. I tried some pattern pattern like [^A-Za-z]{0,}c[,.;\s]{0,}, but it doesn't work yet.
CodePudding user response:
You are on the right track with something based on regex. I think you can use re.sub()
and a simple pattern that expects whitespace then a alpha character and a period.
import re
text = "Diagnosis: B. Prostate, Left Lateral Mid, Core Biopsy: - Prostatic adenocarcinoma, Gleason's score 3 3=6/10 - Single focus of carcinoma measures 0.5 mm (involves 1 of 1 core fragment and up to 5% of individual core volume) - Prostatic intraepithelial neoplasia (PIN high grade C. Prostate, Left Lateral Apex, Core Biopsy: - Prostatic"
pattern = r"\s[a-zA-Z]\."
print(re.sub(pattern, "", text))
That should give you:
Diagnosis: Prostate, Left Lateral Mid, Core Biopsy: - Prostatic adenocarcinoma, Gleason's score 3 3=6/10 - Single focus of carcinoma measures 0.5 mm (involves 1 of 1 core fragment and up to 5% of individual core volume) - Prostatic intraepithelial neoplasia (PIN high grade Prostate, Left Lateral Apex, Core Biopsy: - Prostatic
Note that "B." and a "C." where both removed. I hope that is what you are looking for. If not, you might add a count=1
as a parameter to just remove the "B."