how to split up a pandas series off of every two words in python-CodePudding

if I have a pandas dataframe with a description of an issue, how can I split each value into two separate words at a time?

e.g

Subject Number	Issue
30493	"This subject was unable to keep his head straight in the MRI"
43253	"This subject fell asleep thus ended up with poor data

and I want it to be like

Subject Number	Issue
30493	"This subject", "was unable", "to keep", "his head", "straight in", "the MRI"
43253	"This subject", "fell asleep", "thus ended", "up with", "poor data"

The pandas series here would be df["issue"]

CodePudding user response：

Here is one possible way to do it.

import re
df['Issue'] = df['Issue'].map(lambda string: list(filter(None,re.split(r"(\w \s\w )\s",string))))
print(df)

   Subject Number                                              Issue
0          30493  [This subject, was unable, to keep, his head, ...
1          43253  [This subject, fell asleep, thus ended, up wit...

CodePudding user response：

Using a single regex:

df['Issue'] = df['Issue'].str.findall(r'((?:\S \s*?){2})\s*')

Output:

   Subject Number                                                                Issue
0           30493  [This subject, was unable, to keep, his head, straight in, the MRI]
1           43253          [This subject, fell asleep, thus ended, up with, poor data]

CodePudding user response：

Write a function split_str_to_2grams to do it for a single string, then do df['Issue'].apply(split_str_to_2grams). Untested but try this:

def split_str_to_2grams(string: str) -> list[str]:
    l = string.split()
    return [" ".join(l[2*i:2*i 2]) for i in range(len(l))//2]

CodePudding user response：

This should do the job:

span = 2 # every how many words

df['type'].apply(lambda x: [" ".join(x.split()[i:i span]) for i in range(0, len(x.split()), span)])

Output:

0    [This subject, was unable, to keep, his head, ...
1    [This subject, fell asleep, thus ended, up wit...

Note: I am assuming with the output you mean a list of every two words not exactly the shape with quotations and commas

CodePudding user response：

Not the most elegant solution, but it will get your desired string:

df.issue = df.issue.str.split().apply(lambda x: ', '.join('"'   i   ' '   j   '"' for i, j in zip(x[::2], x[1::2])))

Output:

   Subject Number                                              Issue
0           30493  "This subject", "was unable", "to keep", "his ...
1           43253  "This subject", "fell asleep", "thus ended", "...

Or if you actually want a list:

df.Issue.str.split().apply(lambda x: [' '.join([i,j]) for i, j in zip(x[::2], x[1::2])])

...

0    [This subject, was unable, to keep, his head, ...
1    [This subject, fell asleep, thus ended, up wit...
Name: Issue, dtype: object