Home > Software engineering >  Pandas groupby on text : get sentence numbering for multiple sentences per group
Pandas groupby on text : get sentence numbering for multiple sentences per group

Time:05-20

My dataframe looks like this:

    id      sentence                                            ind
    747     A simple and convenient colorimetric method is...   NaN
    747     A simple and convenient colorimetric method is...   NaN
    747     A simple and convenient colorimetric method is...   ulcerative 
    749     Of special significance was the increased acti...   NaN
    749     Of special significance was the increased acti...   NaN
    749     Of special significance was the increased acti...   head injuries
    749     Of special significance was the increased acti...   NaN
    858     Some patients with acute viral hepatitis or pr...   acute viral 
    858     Some patients with acute viral hepatitis or pr...   NaN
    858     Some patients with acute viral hepatitis or pr...   NaN
    948     The other ALP isozyme of FL cells had properti...   NaN
    948     The other ALP isozyme of FL cells had properti...   NaN
    948     The other ALP isozyme of FL cells had properti...   NaN
    948     It was found that a human hepatoma-associated ...   NaN
    948     It was found that a human hepatoma-associated ...   hepatoma
    948     It was found that a human hepatoma-associated ...   NaN
    948     It was more heat stable and more sensitive to ...   virus
    948     It was more heat stable and more sensitive to ...   NaN
    948     It was more heat stable and more sensitive to ...   NaN

I'm using df.groupby(['id', 'sentence']).first().head(20) and I get this:

pmid    sentence                                            ind
747     A simple and convenient colorimetric method is...   NaN
749     Of special significance was the increased acti...   NaN
858     Some patients with acute viral hepatitis or pr...   acute viral 
948      It was found that a human hepatoma-associated...   hepatoma
         It was more heat stable and more sensitive to...   virus

As we see, for id=948, there are more than one (id-sentence) pairs.

My question is : Is there a way to get a sentence number for every id in my dataframe, since I have more than one (id-sentence) pairs for one id?

For example, to have something like:

id   sentence_nr   sentence                                           ind
747  01            A simple and convenient colorimetric method is...  NaN
749  01            Of special significance was the increased acti...  NaN
858  01            Some patients with acute viral hepatitis or pr...  acute viral 
948  01            It was found that a human hepatoma-associated ...  hepatoma 
948  02            It was more heat stable and more sensitive to ...  virus

CodePudding user response:

You could use GroupBy.cumcount:

df_grouped = df.groupby(['id', 'sentence'], as_index=False).first()
df_grouped['sentence_nr'] = df_grouped.groupby(df_grouped['id']).cumcount()   1

print(df_grouped)
    id                                           sentence            ind  sentence_nr
0  747  A simple and convenient colorimetric method is...     ulcerative            1
1  749  Of special significance was the increased acti...  head injuries            1
2  858  Some patients with acute viral hepatitis or pr...    acute viral            1
3  948  It was found that a human hepatoma-associated ...       hepatoma            1
4  948  It was more heat stable and more sensitive to ...          virus            2
5  948  The other ALP isozyme of FL cells had properti...           None            3
  • Related