I have a csv file that looks something like this:
mark | time | value1 | value2 |
---|---|---|---|
1 | 14:22:02 | 5 | 2 |
1 | 14:22:05 | 8 | 4 |
2 | 14:25:02 | 1 | 1 |
2 | 14:26:05 | 4 | 7 |
3 | 15:12:08 | 5 | 2 |
3 | 15:12:11 | 5 | 4 |
3 | 15:12:15 | 5 | 2 |
3 | 15:12:17 | 8 | 4 |
I would like to output all the matches by label 1 and 3
Expected result:
Number of matches is the number of intersections with the same symbols of the label 1 and 3 That is, if there are 5 in mark 1 and Value 1 column, then it counts the entire number of intersections with mark3 in Value 1
By two columns of value
mark | value1 | value2 | Number of matches |
---|---|---|---|
1-3 | 5 | 2 | 2 |
1-3 | 8 | 4 | 1 |
For value 1
mark | value1 | Number of matches |
---|---|---|
1-3 | 5 | 3 |
1-3 | 8 | 1 |
For value 2
mark | value2 | Number of matches |
---|---|---|
1-3 | 2 | 2 |
1-3 | 4 | 2 |
CodePudding user response:
You can use a groupby
on the filtered DataFrame, then filter again to have a count > 1:
target = ['value1', 'value2']
(df.loc[df['mark'].isin([1,3])]
.astype({'mark': 'str'})
.groupby(target, as_index=False)
.agg(**{'mark': ('mark', lambda g: '-'.join(dict.fromkeys(g))),
'Num matches': ('mark', 'count')
})
.loc[lambda d: d['Num matches'].gt(1)]
)
Output:
value1 value2 mark Num matches
0 5 2 1-3 3
2 8 4 1-3 2