Home > other >  Merge two dfs with duplicate key values in second df
Merge two dfs with duplicate key values in second df

Time:11-30

How can I merge two dataframes on a key that may contain multiple events with the same key? Ideally, I would like to create records with a subset of the data, and insert them as a list of dictionaries in the column but I am open to any possible solutions.

I appreciate any and all feedback. Thank you for your time and assistance.

First Dataframe:

              Timestamp,  type,   id, Address 
2022-11-28 21:28:20.793, Owner, 7249, "8675 Jenny Lane, City, State, Zip"
2022-11-28 21:28:20.793, Owner, 7250, "309 Jenny Court, City, State, Zip"

Second Dataframe:

              Timestamp,   type,   id, inc, string_1, number1
2022-11-28 21:28:20.793, Animal, 7249,   0,     cat1, 3121435
2022-11-28 21:28:20.793, Animal, 7249,   1,     cat2, 3121435
2022-11-28 21:28:20.793, Animal, 7249,   2,     dog1, 3121558
2022-11-28 21:28:20.793, Animal, 7249,   3,   mouse1, 3121559
2022-11-28 21:28:20.793, Animal, 7250,   0,   mouse2, 3121435

Hopeful Resulting dataframe after some voodoo:

              Timestamp,  type,   id,                             Address,  Animals
2022-11-28 21:28:20.793, Owner, 7249, "8675 Jenny Lane, City, State, Zip", [{0, cat1, 3121435}, {1, cat2, 3121435}, {2, dog1, 3121558}, {3, mouse1, 3121558}]
2022-11-28 21:28:20.793, Owner, 7250, "309 Jenny Court, City, State, Zip", [{0, mouse2, 3121435}]

I dont currently care about the Timestamp field being added to the final result for each individual animal record and it need not match the first dataframe or be correlated at all.

CodePudding user response:

This could be it

import pandas as pd
import numpy as np

df1 = pd.DataFrame({
    "Timestamp": ["2022-11-28 21:28:20.793", "2022-11-28 21:28:20.793"],
    "type": ["Owner", "Owner"],
    "id": [7249, 7250],
    "Address": ["8675 Jenny Lane, City, State, Zip", "309 Jenny Court, City, State, Zip"]
})

df2 = pd.DataFrame({
    "Timestamp": ["2022-11-28 21:28:20.793", "2022-11-28 21:28:20.793", "2022-11-28 21:28:20.793", "2022-11-28 21:28:20.793", "2022-11-28 21:28:20.793"],
    "type": ["Animal", "Animal", "Animal", "Animal", "Animal"],
    "id": [7249, 7249, 7249, 7249, 7250],
    "inc": [0, 1, 2, 3, 0],
    "string_1": ["cat1", "cat2", "dog1", "mouse1", "mouse2"],
    "number1": [3121435, 3121435, 3121558, 3121559, 3121435]
})

df1 = df1.set_index("id")
df2 = df2.set_index("id")
df2 = df2.groupby("id").apply(lambda x: x.to_dict(orient="records"))
df1["Animals"] = df2

print(df1)

# Result:
# Timestamp,  type,   id,                             Address,  Animals
# 2022-11-28 21:28:20.793, Owner, 7249, "8675 Jenny Lane, City, State, Zip", [{0, cat1, 3121435}, {1, cat2, 3121435}, {2, dog1, 3121558}, {3, mouse1, 3121558}]
# 2022-11-28 21:28:20.793, Owner, 7250, "309 Jenny Court, City, State, Zip", [{0, mouse2, 3121435}]
  • Related