Home > other >  pandas: split on dot unless there is a number or a character before dot
pandas: split on dot unless there is a number or a character before dot

Time:12-08

I have a dataframe as follows:

import pandas as pd
df = pd.DataFrame({'text':['she is a. good 15. year old girl. she goes to school on time.', 'she is not an A. level student. This needs to be discussed.']})

to split and explode on (.), I have done the following:

df = df.assign(text=df['text'].str.split('.')).explode('text')

However I do not want to split after every dot. so I would like to split on dot, unless dot is surrounded by number (e,g. 22., 3.4) or a single character surrounding the dot (e.g, a. ,a.b., b.d

desired_output:

   text
'she is a. good 15. year old girl'
'she goes to school on time'
'she is not an A. level student'
'This needs to be discussed.'

so, i also tried the following pattern hoping to ignore the single characters and number, but it removes the last letter from the final words of the sentences.

df.assign(text=df['text'].str.split(r'(?:(?<!\.|\s)[a-z]\.|(?<!\.|\s)[A-Z]\.) ')).explode('text')

I edited the pattern, so now it matched all types of dot that come after number or single letter: r'(?:(?<=.|\s)[[a-zA-Z]].|(?<=.|\s)\d ) ' so, i guess i only need to somehow figure out how to split on dot, except this last pattern

CodePudding user response:

#!/usr/bin/python3
# -*- coding: utf-8 -*-

import re

input = 'she is a. good 15. year old girl. she goes to school on time. she is not an A. level student. This needs to be discussed.'

sentences = re.split(r'\.', input)

output = []
text = ''
for v in sentences:
    text = text   v

    if(re.search(r'\s([a-z]{1}|[0-9] )$', v, re.IGNORECASE)):
        text = text   "."
        pass
    else:
        text = text.strip()
        if text != '':
            output.append(text)
        text = ''

print(output)

Output:

['she is a. good 15. year old girl', 'she goes to school on time', 'she is not an A. level student', 'This needs to be discussed']
  • Related