I have the following dataframe:
pandas as pd
df = pd.DataFrame({'Text': ['Hello, I have some text.</p> I would like to split it into sentences. </p> However, when it comes to splitting I want sentences to be indexed so that I can re-join them correctly.</p> I also need to convert lists in df which is tricky.',
'Hello, I have some text.</p> I would like to split it into sentences. </p> However, when it comes to splitting I want sentences to be indexed so that I can re-join them correctly.</p> I also need to convert lists in df which is tricky.',
'Hello, I have some text.</p> I would like to split it into sentences. </p> However, when it comes to splitting I want sentences to be indexed so that I can re-join them correctly.</p> I also need to convert lists in df which is tricky.']})
What I want to do is, first, splitting the dataframe and, then, converting the list into a dataframe and create a column that keep track of how many sentences belong to a text.
To split the text I do:
df.Text.str.split('</p>')
df.Text.str.split('</p>')[0]
As you can see, every element in the original dataframe contains the 4 sentences which I split. I now want to create a dataframe as the the following one:
ID Text
1.1 Hello, I have some text.
1.2 I would like to split it into sentences.
1.3 However, when it comes to splitting I want sentences to be indexed so that I can re-join them correctly.
1.4 I also need to convert lists in df which is tricky.
2.1 Hello, I have some text.
2.2 I would like to split it into sentences.
2.3 However, when it comes to splitting I want sentences to be indexed so that I can re-join them correctly.
2.4 I also need to convert lists in df which is tricky.
3.1 Hello, I have some text.
3.2 I would like to split it into sentences.
3.3 However, when it comes to splitting I want sentences to be indexed so that I can re-join them correctly.
3.4 I also need to convert lists in df which is tricky.
Can anyone help me do it?
Thanks!
PS. In the real example, the sentences are not evenly split as above.
CodePudding user response:
You could use split
to split the strings, then explode
to create new rows, and finally rework the index:
df2 = (df.assign(Text=df['Text'].str.split('</p>'))
.explode('Text')
)
idx = df2.index.to_series().add(1).astype(str)
idx2 = idx.groupby(idx).cumcount().add(1).astype(str)
df2.index = idx '.' idx2
output:
Text
1.1 Hello, I have some text.
1.2 I would like to split it into sentences.
1.3 However, when it comes to splitting I want se...
1.4 I also need to convert lists in df which is t...
2.1 Hello, I have some text.
2.2 I would like to split it into sentences.
2.3 However, when it comes to splitting I want se...
2.4 I also need to convert lists in df which is t...
3.1 Hello, I have some text.
3.2 I would like to split it into sentences.
3.3 However, when it comes to splitting I want se...
3.4 I also need to convert lists in df which is t...