Say i have something like this in a pandas dataframe :
Entity | Type | Doc | Proj |
---|---|---|---|
Daniel | PER | 1 | 1 |
Daniel | PER | 4 | 2 |
Daniel | PER | 5 | 3 |
Daniel | PER | 9 | 6 |
Daniel | LOC | 7 | 4 |
905-888-8988 | ID | 3 | 1 |
905-888-8988 | ID | 4 | 2 |
905-888-8988 | ID | 14 | 8 |
For each combo of entity and type that is recurring, i'd like to add two new columns for the doc and proj corresponding to the match. I'd like to do this to all possible match between combo of entity.
Edit 1 More detailed explanation to get to the expected outcome
Step 1 - Identify if an "Entity" and "Type" combo has more than 1 occurence in the dataframe.
Step 2 - For each combo that have more than 1 occurence, i would need to represent all possible combination of "Doc" and "Proj" of the combos.
Step 3 - All these possible combination should be represented in pairs of "doc" and "proj"
So the result would look like this in the pandas dataframe :
Entity | Type | Doc1 | Proj1 | Doc2 | Proj2 |
---|---|---|---|---|---|
Daniel | PER | 1 | 1 | 4 | 2 |
Daniel | PER | 1 | 1 | 5 | 3 |
Daniel | PER | 1 | 1 | 9 | 6 |
Daniel | PER | 4 | 2 | 5 | 3 |
Daniel | PER | 4 | 2 | 9 | 6 |
Daniel | PER | 5 | 3 | 9 | 6 |
905-888-8988 | ID | 3 | 1 | 4 | 2 |
905-888-8988 | ID | 3 | 1 | 14 | 8 |
905-888-8988 | ID | 4 | 2 | 14 | 8 |
Thanks all for the help
CodePudding user response:
here is one way:
- reset_index to copy the index as a column
- use merge to join df with itself on entity and type columns
- remove the duplicate pairs by only keeping the smaller index of original right side of merge.
df = df.reset_index()
res = pd.merge(df, df, on=['Entity', 'Type'],suffixes=['1', '2'])
res = res.loc[res.index1 < res.index2].drop(columns=['index1', 'index2']).reset_index(drop=True)
output:
>>
Entity Type Doc1 Proj1 Doc2 Proj2
0 Daniel PER 1 1 4 2
1 Daniel PER 1 1 5 3
2 Daniel PER 1 1 9 6
3 Daniel PER 4 2 5 3
4 Daniel PER 4 2 9 6
5 Daniel PER 5 3 9 6
6 905-888-8988 ID 3 1 4 2
7 905-888-8988 ID 3 1 14 8
8 905-888-8988 ID 4 2 14 8