How can I use get_valid_primitives when I have only one dataframe in Featuretools?-CodePudding

I am trying to figure out how Featuretools works and I am testing it on the Housing Prices dataset on Kaggle. Because the dataset is huge, I'll work here with only a set of it.

The dataframe is:

train={'Id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}, 'MSSubClass': {0: 60, 1: 20, 2: 60, 3: 70, 4: 60}, 'MSZoning': {0: 'RL', 1: 'RL', 2: 'RL', 3: 'RL', 4: 'RL'}, 'LotFrontage': {0: 65.0, 1: 80.0, 2: 68.0, 3: 60.0, 4: 84.0}, 'LotArea': {0: 8450, 1: 9600, 2: 11250, 3: 9550, 4: 14260}}

I create an EntitySet for this dataframe:

es_train = ft.EntitySet()

I add the dataframe to the created EntitySet:

es_train.add_dataframe(dataframe_name='train', dataframe=train, index='Id')

Then I call the function:

ap, tp = ft.get_valid_primitives(entityset=es_train, target_dataframe_name='train')

And here it all breaks up, because I get the following error message:

KeyError: 'DataFrame train does not exist in entity set'

I tried to study the tutorials on the Featuretools site, but all I could find are tutorials with multiple dataframes, so it didn't help me at all.

Where am I mistaking? How can I correct the mistake(s)?

Thanks!

Later edit: I am using PyCharm. When I work in script mode, I get the error above. However, when I use the command line, everything works perfectly.

CodePudding user response：

The only issue I see with your code is that you're not wrapping your train object with pd.Dataframe

This code works well for me:

import featuretools as ft
import pandas as pd

train=pd.DataFrame({
    'Id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}, 
    'MSSubClass': {0: 60, 1: 20, 2: 60, 3: 70, 4: 60}, 
    'MSZoning': {0: 'RL', 1: 'RL', 2: 'RL', 3: 'RL', 4: 'RL'}, 
    'LotFrontage': {0: 65.0, 1: 80.0, 2: 68.0, 3: 60.0, 4: 84.0}, 
    'LotArea': {0: 8450, 1: 9600, 2: 11250, 3: 9550, 4: 14260}
})

es_train = ft.EntitySet()
es_train.add_dataframe(dataframe_name='train', dataframe=train, index='Id')

_, tp = ft.get_valid_primitives(entityset=es_train, target_dataframe_name='train')


for p in tp:
    print(p.name)