I have tried looking at other merge rows in pandas solutions here and here and especially the solution here.
I want to combine the individual sentences scraped from bullet points into one paragraph between the empty blank rows. BUT keep the blank rows as they are. I want to keep the first sentence's paragraph id as the new id. (Paragraph ids are not necessarily continuous as there was some pre-cleaning done.)
df = pd.DataFrame(data = [[1, "A Heading"],
[2, "I need to be with above."],
[3, ""],
[8, "I stand alone."],
[9, ""]],columns=['para_id','text'])
df
# The data
# para_id text
# 0 1 A Heading
# 1 2 I need to be with above.
# 2 3
# 3 8 I stand alone.
# 4 9
My required output is:
# para_id text
# 0 1 A Heading. I need to be with above
# 1 3
# 2 8 I stand alone.
# 3 9
With the help of the answers I am close and just need a little more guidance.
Attempted solution
df['t'] = df['text'].str.len().values
s = df['t'].eq(0).cumsum()
out = df.groupby(s).agg({'para_id': 'first',
'text': lambda x: '. '.join(x),
't': 'last'})
out.drop('t', inplace=True, axis=1)
out
# Incorrect output
# para_id text
# t
# 0 1 A Heading. I need to be with above.
# 1 3 . I stand alone.
# 2 9
I almost have it working but my blank lines are getting glued to some text. And so I am missing my first blank row.
Please can someone help me formulate
s
better to get the desired output.I also need the joining . to only occur if there is not a fullstop at the end of last sentence. (This is not critical. I guess I could first search for missing fullstops at end of non-empty text sentences, and then join sentences but I was wondering if there was a mutate if kind of structure in pandas.)
CodePudding user response:
You're almost there, just groupby on both the non-zero lengths and cumsum:
s = df['text'].eq('')
(df.groupby([s.cumsum(),s], sort=False)
.agg({'para_id':'first', 'text': '. '.join})
.reset_index(drop=True)
)
Output:
para_id text
0 1 A Heading. I need to be with above.
1 3
2 8 I stand alone.
3 9