Home > front end >  How to check if there are multiple versions using groupby
How to check if there are multiple versions using groupby

Time:06-25

I want to check if there are documents with different versions in one group. if so, they should be written into a new dataframe.

My initial dataframe looks as follows:

document version group
abc 1 1
abc 1 1
abc 2 1
testtest 4 1
xyz 3 2
xyz 77 2
abc 3 3
qwertz 10 4
qwertz 9 4
x 1 5
x 1 5
import pandas as pd

d = {'document': ['abc', 'abc', 'abc', 'testtest', 'xyz', 'xyz', 'abc', 'qwertz', 'qwertz', 'x', 'x'], 
    'version': [1,1,2,4,3,77,3,10,9,1,1], 
    'group': [1,1,1,1,2,2,3,4,4,5,5]}
df = pd.DataFrame(data=d)

the data frame has a relatively large number of entries. how do I make the performance technically reasonably effective?

Output should be the following:

group document version
1 abc 1
1 abc 2
2 xyz 3
2 xyz 77
4 qwertz 10
4 qwertz 9

This means that group "2" contains the document "abc" twice in different versions. namely in version "1" and "2". A document that occurs several times in a group but with the same version should not be listed (document "x").

CodePudding user response:

You can use masks for boolean indexing:

# is the full row not duplicated?
m1 = ~df.duplicated()
# is there more that one version per document group?
m2 = df.groupby(['document', 'group'])['version'].transform('nunique').gt(1)

out = df[m1&m2] # keep if both conditions are met

output:

  document  version  group
0      abc        1      1
2      abc        2      1
4      xyz        3      2
5      xyz       77      2
7   qwertz       10      4
8   qwertz        9      4
  • Related