if I have a pandas dataframe with a description of an issue, how can I split each value into two separate words at a time?
e.g
Subject Number | Issue |
---|---|
30493 | "This subject was unable to keep his head straight in the MRI" |
43253 | "This subject fell asleep thus ended up with poor data |
and I want it to be like
Subject Number | Issue |
---|---|
30493 | "This subject", "was unable", "to keep", "his head", "straight in", "the MRI" |
43253 | "This subject", "fell asleep", "thus ended", "up with", "poor data" |
The pandas series here would be df["issue"]
CodePudding user response:
Here is one possible way to do it.
import re
df['Issue'] = df['Issue'].map(lambda string: list(filter(None,re.split(r"(\w \s\w )\s",string))))
print(df)
Subject Number Issue
0 30493 [This subject, was unable, to keep, his head, ...
1 43253 [This subject, fell asleep, thus ended, up wit...
CodePudding user response:
Using a single regex:
df['Issue'] = df['Issue'].str.findall(r'((?:\S \s*?){2})\s*')
Output:
Subject Number Issue
0 30493 [This subject, was unable, to keep, his head, straight in, the MRI]
1 43253 [This subject, fell asleep, thus ended, up with, poor data]
CodePudding user response:
Write a function split_str_to_2grams
to do it for a single string, then do df['Issue'].apply(split_str_to_2grams)
. Untested but try this:
def split_str_to_2grams(string: str) -> list[str]:
l = string.split()
return [" ".join(l[2*i:2*i 2]) for i in range(len(l))//2]
CodePudding user response:
This should do the job:
span = 2 # every how many words
df['type'].apply(lambda x: [" ".join(x.split()[i:i span]) for i in range(0, len(x.split()), span)])
Output:
0 [This subject, was unable, to keep, his head, ...
1 [This subject, fell asleep, thus ended, up wit...
Note: I am assuming with the output you mean a list of every two words not exactly the shape with quotations and commas
CodePudding user response:
Not the most elegant solution, but it will get your desired string:
df.issue = df.issue.str.split().apply(lambda x: ', '.join('"' i ' ' j '"' for i, j in zip(x[::2], x[1::2])))
Output:
Subject Number Issue
0 30493 "This subject", "was unable", "to keep", "his ...
1 43253 "This subject", "fell asleep", "thus ended", "...
Or if you actually want a list:
df.Issue.str.split().apply(lambda x: [' '.join([i,j]) for i, j in zip(x[::2], x[1::2])])
...
0 [This subject, was unable, to keep, his head, ...
1 [This subject, fell asleep, thus ended, up wit...
Name: Issue, dtype: object