split row into multiple while keeping other columns same pandas python-CodePudding

I have a dataframe as below

import pandas as pd
df = pd.DataFrame({"order_id":[1,3,7],"order_date":["20/5/2018","22/5/2018","23/5/2018"], "package":["p1","p4","p5,p6"],"package_code":["As he crossed toward the pharmacy at the","he was dancing in the","they were playing football"]})
df

    order_id    order_date  package package_code
0   1   20/5/2018   p1  As he crossed toward the pharmacy at the
1   3   22/5/2018   p4  he was dancing in the
2   7   23/5/2018   p5,p6   they were playing football

I have written a function as below which splits a string into group of 5 words

s = 'As he crossed toward the pharmacy at the corner '
n = 5

def group_words(s, n):
    words = s.split()
    for i in range(0, len(words), n):
        yield ' '.join(words[i:i n])

list(group_words(s,n))


['As he crossed toward the', 'pharmacy at the corner']

I would like to take the dataframe and split the column 'package_code' into multiple rows of 5 words each while keeping rest of the column same (per row).

How could I do that

For example the first row should be:

order_id    order_date  package package_code
0   1   20/5/2018   p1  As he crossed toward the 
0   1   20/5/2018   p1  pharmacy at the

I tried below but it doesn't provide what I am looking for

(df.set_index(['order_id', 'order_date'])
   .apply(lambda x: group_words(x, 3))
   .reset_index()) 


index   0
0   package <generator object group_words at 0x7fa263e98570>
1   package_code    <generator object group_words at 0x7fa263e98678>

CodePudding user response：

You can use a list to unpack the generator and use explode with series.map:

col = 'package_code'
s = df['package_code'].map(lambda x: list(group_words(x,n))).explode()
out = s.to_frame().join(df.drop(col,1)).loc[:,[*df]]

print(out)

   order_id order_date package                package_code
0         1  20/5/2018      p1    As he crossed toward the
0         1  20/5/2018      p1             pharmacy at the
1         3  22/5/2018      p4       he was dancing in the
2         7  23/5/2018   p5,p6  they were playing football

CodePudding user response：

You can use extractall and a small regex ((?:\w \s ?){1,5} = 5 words), thus no need for an external function:

(df.drop('package_code', axis=1) # remove existing column as we replace after
   .join(df['package_code'].str.extractall('(?P<package_code>(?:\w \s ?){1,5})').droplevel(1))
)

output:

   order_id order_date package               package_code
0         1  20/5/2018      p1  As he crossed toward the 
0         1  20/5/2018      p1               pharmacy at 
1         3  22/5/2018      p4         he was dancing in 
2         7  23/5/2018   p5,p6         they were playing