My dataframe looks like this:
id sentence ind
747 A simple and convenient colorimetric method is... NaN
747 A simple and convenient colorimetric method is... NaN
747 A simple and convenient colorimetric method is... ulcerative
749 Of special significance was the increased acti... NaN
749 Of special significance was the increased acti... NaN
749 Of special significance was the increased acti... head injuries
749 Of special significance was the increased acti... NaN
858 Some patients with acute viral hepatitis or pr... acute viral
858 Some patients with acute viral hepatitis or pr... NaN
858 Some patients with acute viral hepatitis or pr... NaN
948 The other ALP isozyme of FL cells had properti... NaN
948 The other ALP isozyme of FL cells had properti... NaN
948 The other ALP isozyme of FL cells had properti... NaN
948 It was found that a human hepatoma-associated ... NaN
948 It was found that a human hepatoma-associated ... hepatoma
948 It was found that a human hepatoma-associated ... NaN
948 It was more heat stable and more sensitive to ... virus
948 It was more heat stable and more sensitive to ... NaN
948 It was more heat stable and more sensitive to ... NaN
I'm using df.groupby(['id', 'sentence']).first().head(20)
and I get this:
pmid sentence ind
747 A simple and convenient colorimetric method is... NaN
749 Of special significance was the increased acti... NaN
858 Some patients with acute viral hepatitis or pr... acute viral
948 It was found that a human hepatoma-associated... hepatoma
It was more heat stable and more sensitive to... virus
As we see, for id=948
, there are more than one (id-sentence) pairs.
My question is : Is there a way to get a sentence number for every id in my dataframe, since I have more than one (id-sentence) pairs for one id?
For example, to have something like:
id sentence_nr sentence ind
747 01 A simple and convenient colorimetric method is... NaN
749 01 Of special significance was the increased acti... NaN
858 01 Some patients with acute viral hepatitis or pr... acute viral
948 01 It was found that a human hepatoma-associated ... hepatoma
948 02 It was more heat stable and more sensitive to ... virus
CodePudding user response:
You could use GroupBy.cumcount
:
df_grouped = df.groupby(['id', 'sentence'], as_index=False).first()
df_grouped['sentence_nr'] = df_grouped.groupby(df_grouped['id']).cumcount() 1
print(df_grouped)
id sentence ind sentence_nr
0 747 A simple and convenient colorimetric method is... ulcerative 1
1 749 Of special significance was the increased acti... head injuries 1
2 858 Some patients with acute viral hepatitis or pr... acute viral 1
3 948 It was found that a human hepatoma-associated ... hepatoma 1
4 948 It was more heat stable and more sensitive to ... virus 2
5 948 The other ALP isozyme of FL cells had properti... None 3