I plan to calculate distance between categorical data (Distance From First Dataset to Second Dataset)
Here's my dataset
First Dataset
Id City Phone
A1 Jakarta Samsung
A2 Surabaya Apple
A3 Singapore Sony
Second Dataset (as Input)
Id City Phone
1 Jakarta Samsung
2 Singapore Xiaomi
3 Surabaya Sony
Reference table for City
Jakarta Surabaya Singapore
Jakarta 0 1 2
Surabaya 1 0 1
Singapore 2 1 0
Reference table for Phone
Apple Samsung Sony Xiaomi
Apple 0 1 2 3
Samsung 1 0 1 2
Sony 2 1 0 1
Xiaomi 3 2 1 0
Second Dataset (as Output) Calculation
Id City Phone A1 A2 A3
1 Jakarta Samsung 0 0 1 1 2 1
2 Singapore Xiaomi 1 2 2 3 0 1
3 Surabaya Sony 1 1 0 2 2 0
As I try to calculate distance between every single rows on first dataset and second dataset Final Output
Id City Phone A1 A2 A3
1 Jakarta Samsung 0 2 3
2 Singapore Xiaomi 3 5 1
3 Surabaya Sony 2 2 2
Notes:
A1, A2, A3
is rows in first dataset
What I did is using a lot of joins, but the amount of code is too much
CodePudding user response:
Here is one way to do it. You can cross join the first and second datasets and use .loc lookups by city and phone.
PS - the count you have given in the expected output is incorrect as mentioned in the comments.
df1 = pd.DataFrame(data=[ ["A1","Jakarta","Samsung"], ["A2","Surabaya","Apple"], ["A3","Singapore","Sony"] ], columns=["Id","City","Phone"])
df2 = pd.DataFrame(data=[ [1,"Jakarta","Samsung"], [2,"Singapore","Xiaomi"], [3,"Surabaya","Sony"] ], columns=["Id","City","Phone"])
city_df = pd.DataFrame(data=[ ["Jakarta",0,1,2], ["Surabaya",1,0,1], ["Singapore",2,1,0] ], columns=["city","Jakarta","Surabaya","Singapore"]).set_index("city")
phone_df = pd.DataFrame(data=[ ["Apple",0,1,2,3], ["Samsung",1,0,1,2], ["Sony",2,1,0,1], ["Xiaomi",3,2,1,0] ], columns=["phone","Apple","Samsung","Sony","Xiaomi"]).set_index("phone")
df1["dummy_key"] = 0
df2["dummy_key"] = 0
final_df = df2.merge(df1, on="dummy_key", how="outer")
final_df["dist"] = final_df.apply(lambda row: city_df.loc[row["City_x"], row["City_y"]] phone_df.loc[row["Phone_x"], row["Phone_y"]], axis=1)
final_df = final_df.drop(["City_y", "Phone_y", "dummy_key"], axis=1)
final_df = final_df.rename(columns={"Id_x": "Id", "City_x": "City", "Phone_x": "Phone"})
final_df = final_df.pivot(index=["Id", "City", "Phone"], columns="Id_y", values="dist").reset_index()
Output:
Id City Phone A1 A2 A3
1 Jakarta Samsung 0 2 3
2 Singapore Xiaomi 4 4 1
3 Surabaya Sony 2 2 1
CodePudding user response:
You can use cross join for all combinations of rows between both DataFrames, then add new columns by DataFrame.join
, sum both columns and last reshape by DataFrame.pivot
:
df = df2.merge(df1, how='cross', suffixes=('','_'))
df = df.join(City.stack().rename('C'), on=['City','City_'])
df = df.join(Phone.stack().rename('P'), on=['Phone','Phone_'])
df['Both'] = df['C'].add(df['P'])
print (df)
Id City Phone Id_ City_ Phone_ C P Both
0 1 Jakarta Samsung A1 Jakarta Samsung 0 0 0
1 1 Jakarta Samsung A2 Surabaya Apple 1 1 2
2 1 Jakarta Samsung A3 Singapore Sony 2 1 3
3 2 Singapore Xiaomi A1 Jakarta Samsung 2 2 4
4 2 Singapore Xiaomi A2 Surabaya Apple 1 3 4
5 2 Singapore Xiaomi A3 Singapore Sony 0 1 1
6 3 Surabaya Sony A1 Jakarta Samsung 1 1 2
7 3 Surabaya Sony A2 Surabaya Apple 0 2 2
8 3 Surabaya Sony A3 Singapore Sony 1 0 1
df = df.pivot(['Id','City','Phone'],'Id_','Both').rename_axis(None, axis=1).reset_index()
print (df)
Id City Phone A1 A2 A3
0 1 Jakarta Samsung 0 2 3
1 2 Singapore Xiaomi 4 4 1
2 3 Surabaya Sony 2 2 1