Home > database >  How to split a column of string in a dataframe into multiple rows of subtexts?
How to split a column of string in a dataframe into multiple rows of subtexts?

Time:10-30

I have a pandas dataframe. This dataframe consists of two columns. One column contains strings of spam email that exceeds the maximum sequence length of transformer models I plan to use on them, and the other contains the labels corresponding to the string. I would like to split the long strings into multiple subtexts in separate rows while retaining their label correspondance.

Input Dataframe:

Text                                              Label
"This is a very long spam email"                  1
"This is a very long normal email"                0

Desired output:

Maximum Sequence Length = 4

Text                                              Label
"This is a very"                                  1
"long spam email"                                 1
"This is a very"                                  0
"long normal email"                               0

How could I do this?

CodePudding user response:

You can use the .split() method to convert the string into a list and then use the .join() method and [ ] to convert the first four elements of the list into a string. Here is my code, if you need it for longer strings, you can add a for loop:

def convert(string):
    nlist = string.split(' ')
    nlist1= nlist[:4]
    nlist2= nlist[4:]
    nstring1 = " ".join(nlist1)
    nstring2 = " ".join(nlist2)
    return nstring1, nstring2
    
x = "This is a very long spam email"
print(convert(x))

CodePudding user response:

Data:

>>> df = pd.DataFrame({"Text" : ["This is a very very very very long spam spam span email email", "This is a a a a very long long long normal email"],
              "Label" : [1,0]})
>>> print(df.to_string())
                                                            Text  Label
0  This is a very very very very long spam spam span email email      1
1               This is a a a a very long long long normal email      0

Solution:

# break the text column in sublists, each list contains at most 4 words.
>>> t = df.apply(lambda x:x.Text.split(), axis=1).apply(lambda x: [x[i * 4:(i   1) * 4] for i in range((len(x)   4 - 1) // 4 )])
df['t'] = t
>>> l = df.apply(lambda x:[x.Label] * len(x.t), axis=1)
# flat a list and make a dataframe from it. 
>>> df = pd.DataFrame({"Text" : functools.reduce(operator.iconcat, df.t.to_list(), []), 
              "Label" : functools.reduce(operator.iconcat, l.to_list(), [])})
>>> df['Text'] = df['Text'].apply(' '.join)
>>> df
    Text                    Label
0   This is a very          1
1   very very very long     1
2   spam spam span email    1
3   email                   1
4   This is a a             0
5   a a very long           0
6   long long normal email  0
  • Related