I have this data frame
df=
ID join Chapter ParaIndex text
0 NaN 1 0 I am test
1 NaN 2 1 it is easy
2 1 3 2 but not so
3 1 3 3 much easy
I want to get this
(merge the column "text" with the same index in column "join" and reindex "ID" and "ParaIndex", rest without change)
dfEdited=
ID join Chapter ParaIndex text
0 NaN 1 0 I am test
1 NaN 2 1 it is easy
2 1 3 2 but not so much easy
I used this command
dfedited=df.groupby(['join'])['text'].apply(lambda x: ' '.join(x.astype(str))).reset_index()
it only merges the row with the numerical index in column join and exclude row with non index
so I changed to this
dfedited=df.groupby(['join'],dropna=False)['text'].apply(lambda x: ' '.join(x.astype(str))).reset_index()
here it merges all rows based on index join but it considers row with index NaN as one group therefore join them also to be group! however, I do not want to join them ...any idea? many thanks
I also used this
dfedited=df.groupby(['join', "ParaIndex", "Chapter"],dropna=False )['text'].apply(lambda x: ' '.join(x.astype(str) )).reset_index()
it looks better as it has all columns, but no changes!!
CodePudding user response:
I hope you can give an example of data and code. And do it step by step rather than just code it in one line without testing. It's hard to help you with this one-line code.
But the main idea is to use merge(..., on='join')
CodePudding user response:
I solved that so;
dfEdited = df.assign(key=df['join'].ne(df['join'].shift()).cumsum()).groupby('key').agg({ "ParaIndex": 'first', "Chapter":'first','text':' '.join}).reset_index()