I have a dataframe as below
import pandas as pd
df = pd.DataFrame({"order_id":[1,3,7],"order_date":["20/5/2018","22/5/2018","23/5/2018"], "package":["p1","p4","p5,p6"],"package_code":["As he crossed toward the pharmacy at the","he was dancing in the","they were playing football"]})
df
order_id order_date package package_code
0 1 20/5/2018 p1 As he crossed toward the pharmacy at the
1 3 22/5/2018 p4 he was dancing in the
2 7 23/5/2018 p5,p6 they were playing football
I have written a function as below which splits a string into group of 5 words
s = 'As he crossed toward the pharmacy at the corner '
n = 5
def group_words(s, n):
words = s.split()
for i in range(0, len(words), n):
yield ' '.join(words[i:i n])
list(group_words(s,n))
['As he crossed toward the', 'pharmacy at the corner']
I would like to take the dataframe and split the column 'package_code' into multiple rows of 5 words each while keeping rest of the column same (per row).
How could I do that
For example the first row should be:
order_id order_date package package_code
0 1 20/5/2018 p1 As he crossed toward the
0 1 20/5/2018 p1 pharmacy at the
I tried below but it doesn't provide what I am looking for
(df.set_index(['order_id', 'order_date'])
.apply(lambda x: group_words(x, 3))
.reset_index())
index 0
0 package <generator object group_words at 0x7fa263e98570>
1 package_code <generator object group_words at 0x7fa263e98678>
CodePudding user response:
You can use a list to unpack the generator and use explode with series.map
:
col = 'package_code'
s = df['package_code'].map(lambda x: list(group_words(x,n))).explode()
out = s.to_frame().join(df.drop(col,1)).loc[:,[*df]]
print(out)
order_id order_date package package_code
0 1 20/5/2018 p1 As he crossed toward the
0 1 20/5/2018 p1 pharmacy at the
1 3 22/5/2018 p4 he was dancing in the
2 7 23/5/2018 p5,p6 they were playing football
CodePudding user response:
You can use extractall
and a small regex ((?:\w \s ?){1,5}
= 5 words), thus no need for an external function:
(df.drop('package_code', axis=1) # remove existing column as we replace after
.join(df['package_code'].str.extractall('(?P<package_code>(?:\w \s ?){1,5})').droplevel(1))
)
output:
order_id order_date package package_code
0 1 20/5/2018 p1 As he crossed toward the
0 1 20/5/2018 p1 pharmacy at
1 3 22/5/2018 p4 he was dancing in
2 7 23/5/2018 p5,p6 they were playing