How to check if there are multiple versions using groupby-CodePudding

I want to check if there are documents with different versions in one group. if so, they should be written into a new dataframe.

My initial dataframe looks as follows:

document	version	group
abc	1	1
abc	1	1
abc	2	1
testtest	4	1
xyz	3	2
xyz	77	2
abc	3	3
qwertz	10	4
qwertz	9	4
x	1	5
x	1	5

import pandas as pd

d = {'document': ['abc', 'abc', 'abc', 'testtest', 'xyz', 'xyz', 'abc', 'qwertz', 'qwertz', 'x', 'x'], 
    'version': [1,1,2,4,3,77,3,10,9,1,1], 
    'group': [1,1,1,1,2,2,3,4,4,5,5]}
df = pd.DataFrame(data=d)

the data frame has a relatively large number of entries. how do I make the performance technically reasonably effective?

Output should be the following:

group	document	version
1	abc	1
1	abc	2
2	xyz	3
2	xyz	77
4	qwertz	10
4	qwertz	9

This means that group "2" contains the document "abc" twice in different versions. namely in version "1" and "2". A document that occurs several times in a group but with the same version should not be listed (document "x").

CodePudding user response：

You can use masks for boolean indexing:

# is the full row not duplicated?
m1 = ~df.duplicated()
# is there more that one version per document group?
m2 = df.groupby(['document', 'group'])['version'].transform('nunique').gt(1)

out = df[m1&m2] # keep if both conditions are met

output:

  document  version  group
0      abc        1      1
2      abc        2      1
4      xyz        3      2
5      xyz       77      2
7   qwertz       10      4
8   qwertz        9      4