I have a dataframe as follows:
import pandas as pd
df = pd.DataFrame({'text':['she is a. good 15. year old girl. she goes to school on time.', 'she is not an A. level student. This needs to be discussed.']})
to split and explode on (.), I have done the following:
df = df.assign(text=df['text'].str.split('.')).explode('text')
However I do not want to split after every dot. so I would like to split on dot, unless dot is surrounded by number (e,g. 22., 3.4) or a single character surrounding the dot (e.g, a. ,a.b., b.d
desired_output:
text
'she is a. good 15. year old girl'
'she goes to school on time'
'she is not an A. level student'
'This needs to be discussed.'
so, i also tried the following pattern hoping to ignore the single characters and number, but it removes the last letter from the final words of the sentences.
df.assign(text=df['text'].str.split(r'(?:(?<!\.|\s)[a-z]\.|(?<!\.|\s)[A-Z]\.) ')).explode('text')
I edited the pattern, so now it matched all types of dot that come after number or single letter: r'(?:(?<=.|\s)[[a-zA-Z]].|(?<=.|\s)\d ) ' so, i guess i only need to somehow figure out how to split on dot, except this last pattern
CodePudding user response:
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import re
input = 'she is a. good 15. year old girl. she goes to school on time. she is not an A. level student. This needs to be discussed.'
sentences = re.split(r'\.', input)
output = []
text = ''
for v in sentences:
text = text v
if(re.search(r'\s([a-z]{1}|[0-9] )$', v, re.IGNORECASE)):
text = text "."
pass
else:
text = text.strip()
if text != '':
output.append(text)
text = ''
print(output)
Output:
['she is a. good 15. year old girl', 'she goes to school on time', 'she is not an A. level student', 'This needs to be discussed']