Home > Mobile >  Merge consecutive rows in pandas and leave some rows untouched
Merge consecutive rows in pandas and leave some rows untouched

Time:11-26

I have tried looking at other merge rows in pandas solutions here and here and especially the solution here.

I want to combine the individual sentences scraped from bullet points into one paragraph between the empty blank rows. BUT keep the blank rows as they are. I want to keep the first sentence's paragraph id as the new id. (Paragraph ids are not necessarily continuous as there was some pre-cleaning done.)

df = pd.DataFrame(data = [[1, "A Heading"],
                          [2, "I need to be with above."],
                          [3, ""],
                          [8, "I stand alone."],
                          [9, ""]],columns=['para_id','text'])

df   
# The data
#    para_id                      text
# 0        1                 A Heading
# 1        2  I need to be with above.
# 2        3                          
# 3        8            I stand alone.
# 4        9                         

My required output is:

#    para_id                                    text
# 0        1  A Heading. I need to be with above
# 1        3  
# 2        8  I stand alone.
# 3        9   

With the help of the answers I am close and just need a little more guidance.

Attempted solution

df['t'] =  df['text'].str.len().values
s = df['t'].eq(0).cumsum()
out = df.groupby(s).agg({'para_id': 'first',
                         'text': lambda x: '. '.join(x),
                         't': 'last'})
out.drop('t', inplace=True, axis=1)
   
out
# Incorrect output
#    para_id                                 text
# t                                              
# 0        1  A Heading. I need to be with above.
# 1        3                     . I stand alone.
# 2        9

I almost have it working but my blank lines are getting glued to some text. And so I am missing my first blank row.

  1. Please can someone help me formulate s better to get the desired output.

  2. I also need the joining . to only occur if there is not a fullstop at the end of last sentence. (This is not critical. I guess I could first search for missing fullstops at end of non-empty text sentences, and then join sentences but I was wondering if there was a mutate if kind of structure in pandas.)

CodePudding user response:

You're almost there, just groupby on both the non-zero lengths and cumsum:

s = df['text'].eq('')

(df.groupby([s.cumsum(),s], sort=False)
   .agg({'para_id':'first', 'text': '. '.join})
   .reset_index(drop=True)
)

Output:

   para_id                                 text
0        1  A Heading. I need to be with above.
1        3                                     
2        8                       I stand alone.
3        9                                     
  • Related