I am trying to figure out how Featuretools works and I am testing it on the Housing Prices dataset on Kaggle. Because the dataset is huge, I'll work here with only a set of it.
The dataframe is:
train={'Id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}, 'MSSubClass': {0: 60, 1: 20, 2: 60, 3: 70, 4: 60}, 'MSZoning': {0: 'RL', 1: 'RL', 2: 'RL', 3: 'RL', 4: 'RL'}, 'LotFrontage': {0: 65.0, 1: 80.0, 2: 68.0, 3: 60.0, 4: 84.0}, 'LotArea': {0: 8450, 1: 9600, 2: 11250, 3: 9550, 4: 14260}}
I create an EntitySet for this dataframe:
es_train = ft.EntitySet()
I add the dataframe to the created EntitySet:
es_train.add_dataframe(dataframe_name='train', dataframe=train, index='Id')
Then I call the function:
ap, tp = ft.get_valid_primitives(entityset=es_train, target_dataframe_name='train')
And here it all breaks up, because I get the following error message:
KeyError: 'DataFrame train does not exist in entity set'
I tried to study the tutorials on the Featuretools site, but all I could find are tutorials with multiple dataframes, so it didn't help me at all.
Where am I mistaking? How can I correct the mistake(s)?
Thanks!
Later edit: I am using PyCharm. When I work in script mode, I get the error above. However, when I use the command line, everything works perfectly.
CodePudding user response:
The only issue I see with your code is that you're not wrapping your train object with pd.Dataframe
This code works well for me:
import featuretools as ft
import pandas as pd
train=pd.DataFrame({
'Id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'MSSubClass': {0: 60, 1: 20, 2: 60, 3: 70, 4: 60},
'MSZoning': {0: 'RL', 1: 'RL', 2: 'RL', 3: 'RL', 4: 'RL'},
'LotFrontage': {0: 65.0, 1: 80.0, 2: 68.0, 3: 60.0, 4: 84.0},
'LotArea': {0: 8450, 1: 9600, 2: 11250, 3: 9550, 4: 14260}
})
es_train = ft.EntitySet()
es_train.add_dataframe(dataframe_name='train', dataframe=train, index='Id')
_, tp = ft.get_valid_primitives(entityset=es_train, target_dataframe_name='train')
for p in tp:
print(p.name)