I have three different data frames with basketball players' data.
In all three dataframes there are basketball players' names. I want to join all three dataframes into one EntitySet to use automatic feature generation using featuretools.
As I understand, I need to create an integer key in 3 dataframes, which would be used to join all three dataframes. I understand that the same unique integer ids should be the same for the same players.
How can I create unique integer keys for 3 different datasets, ensuring that the same players have the same ids?
CodePudding user response:
You do not need to create an integer key to create the relationships. If your names are unique you can simply use them directly in defining the relationships.
import pandas as pd
import featuretools as ft
players = pd.DataFrame({
"name": ["John", "Jane", "Bill"],
"date": pd.to_datetime(["2020-01-01", "2020-02-01" ,"2020-03-01"]),
"other_data": [100, 200, 300]
})
scores = pd.DataFrame({
"game_id": [0, 1, 2],
"player": ["John", "John", "Jane"],
"score": [24, 17, 29]
})
es = ft.EntitySet()
es.add_dataframe(dataframe_name="players", dataframe=players, index="name")
es.add_dataframe(dataframe_name="scores", dataframe=scores, index="game_id")
es.add_relationship("players", "name", "scores", "player")
If your player names are not unique, then you won't be able to create a unique integer id from the names alone. You would have to combine the name with some other piece of information (something like team) to create a new column in your dataframe that uniquely identifies the player in all of your dataframes.