I want to check if there are documents with different versions in one group. if so, they should be written into a new dataframe.
My initial dataframe looks as follows:
document | version | group |
---|---|---|
abc | 1 | 1 |
abc | 1 | 1 |
abc | 2 | 1 |
testtest | 4 | 1 |
xyz | 3 | 2 |
xyz | 77 | 2 |
abc | 3 | 3 |
qwertz | 10 | 4 |
qwertz | 9 | 4 |
x | 1 | 5 |
x | 1 | 5 |
import pandas as pd
d = {'document': ['abc', 'abc', 'abc', 'testtest', 'xyz', 'xyz', 'abc', 'qwertz', 'qwertz', 'x', 'x'],
'version': [1,1,2,4,3,77,3,10,9,1,1],
'group': [1,1,1,1,2,2,3,4,4,5,5]}
df = pd.DataFrame(data=d)
the data frame has a relatively large number of entries. how do I make the performance technically reasonably effective?
Output should be the following:
group | document | version |
---|---|---|
1 | abc | 1 |
1 | abc | 2 |
2 | xyz | 3 |
2 | xyz | 77 |
4 | qwertz | 10 |
4 | qwertz | 9 |
This means that group "2" contains the document "abc" twice in different versions. namely in version "1" and "2". A document that occurs several times in a group but with the same version should not be listed (document "x").
CodePudding user response:
You can use masks for boolean indexing:
# is the full row not duplicated?
m1 = ~df.duplicated()
# is there more that one version per document group?
m2 = df.groupby(['document', 'group'])['version'].transform('nunique').gt(1)
out = df[m1&m2] # keep if both conditions are met
output:
document version group
0 abc 1 1
2 abc 2 1
4 xyz 3 2
5 xyz 77 2
7 qwertz 10 4
8 qwertz 9 4