I have a pandas dataframe. This dataframe consists of two columns. One column contains strings of spam email that exceeds the maximum sequence length of transformer models I plan to use on them, and the other contains the labels corresponding to the string. I would like to split the long strings into multiple subtexts in separate rows while retaining their label correspondance.
Input Dataframe:
Text Label
"This is a very long spam email" 1
"This is a very long normal email" 0
Desired output:
Maximum Sequence Length = 4
Text Label
"This is a very" 1
"long spam email" 1
"This is a very" 0
"long normal email" 0
How could I do this?
CodePudding user response:
You can use the .split() method to convert the string into a list and then use the .join() method and [ ] to convert the first four elements of the list into a string. Here is my code, if you need it for longer strings, you can add a for loop:
def convert(string):
nlist = string.split(' ')
nlist1= nlist[:4]
nlist2= nlist[4:]
nstring1 = " ".join(nlist1)
nstring2 = " ".join(nlist2)
return nstring1, nstring2
x = "This is a very long spam email"
print(convert(x))
CodePudding user response:
Data:
>>> df = pd.DataFrame({"Text" : ["This is a very very very very long spam spam span email email", "This is a a a a very long long long normal email"],
"Label" : [1,0]})
>>> print(df.to_string())
Text Label
0 This is a very very very very long spam spam span email email 1
1 This is a a a a very long long long normal email 0
Solution:
# break the text column in sublists, each list contains at most 4 words.
>>> t = df.apply(lambda x:x.Text.split(), axis=1).apply(lambda x: [x[i * 4:(i 1) * 4] for i in range((len(x) 4 - 1) // 4 )])
df['t'] = t
>>> l = df.apply(lambda x:[x.Label] * len(x.t), axis=1)
# flat a list and make a dataframe from it.
>>> df = pd.DataFrame({"Text" : functools.reduce(operator.iconcat, df.t.to_list(), []),
"Label" : functools.reduce(operator.iconcat, l.to_list(), [])})
>>> df['Text'] = df['Text'].apply(' '.join)
>>> df
Text Label
0 This is a very 1
1 very very very long 1
2 spam spam span email 1
3 email 1
4 This is a a 0
5 a a very long 0
6 long long normal email 0