Split several sentences in pandas dataframe-CodePudding

I have a pandas dataframe with a column that looks like this.

sentences
['This is text.', 'This is another text.', 'This is also text.', 'Even more text.']
['This is the same in another row.', 'Another row another text.', 'Text in second row.', 'Last text in second row.']

In every row there are 10 sentences in ' ' or " " separated by commas. The column type is "str". I was not able to transform it to a list of strings.

I want to transform the values of this dataframe that they look like this:

[['This', 'is', 'text'], ['This', 'is', 'another', 'text'], ['This', 'is', 'also', 'text'], ['Even', 'more', 'text']]

I tried something like this:

    new_splits = []
    for num in range(len(refs)):
      komma = refs[num].replace(" ", "\', \'")#regex=True)
      new_splits.append(komma)

and this:

    new_splits = []
    for num in range(len(refs)):
      splitted = refs[num].split("', '")
      new_splits.append(splitted)

Disclaimer: I need this for evaluating bleu score and haven't found a way to do this for this kind of dataset. Thanks in advance!

CodePudding user response：

you can use apply method on your dataframe. if you say each row has 10 sentences then you can groupby each 10 sentence like this.

import pandas as pd

group_labels = [i // 10 for i in range(len(df))]

grouped = df.groupby(group_labels)

result = grouped['sentences'].apply(lambda x: list(x))

print(result)

CodePudding user response：

You can use np.char.split in one line:

df['separated'] = np.char.split(df['sentences'].tolist()).tolist()

@Kata if you think the sentences column type is str meaning the element in each row is a string instead of a list, for e.g. "['This is text.', 'This is another text.', 'This is also text.', 'Even more text.']" then you need to try to convert them into lists first. One way is to use ast.literal_eval.

from ast import literal_eval
df['sentences'] = df['sentences'].apply(literal_eval)
df['separated'] = np.char.split(df['sentences'].tolist()).tolist()

NOTE on data: This is not a recommended way of storing data. If possible fix the source from which data is coming. It needs to be strings in each cell not lists preferably, or at least just lists, and not a string representing list.