I'm looking at a high dimensional dataset of recipes (180K) * ingredients (~8000). I made the values binary, depending on whether or not an ingredient is included in the recipe. Obviously, in using Kmodes, if I replace NaNs with 0s '''data = data.replace(np.nan, 0)'''. I end up with one dense category (from the zeros) and one value in each of the other clusters (similarity is based on both the 1's and the 0's). So the question is what can I make these NaNs such that Kmodes does not take them into account?
from kmodes.kmodes import KModes
km_cao = KModes(n_clusters=20, init = "Cao", n_init = 1, verbose=1)
fitClusters_cao = km_cao.fit_predict(data)
fitClusters_cao
Example:
import pandas as pd
import numpy as np
{'recipe_id': {0: 424415, 1: 424415, 2: 424415, 3: 424415, 4: 424415, 5: 146223, 6: 146223, 7: 146223, 8: 146223, 9: 146223, 10: 146223, 11: 146223, 12: 146223, 13: 146223, 14: 146223, 15: 146223, 16: 146223, 17: 312329, 18: 312329, 19: 312329}, 'ingredient_ids': {0: 389, 1: 7655, 2: 6270, 3: 1527, 4: 3406, 5: 2683, 6: 4969, 7: 800, 8: 5298, 9: 840, 10: 2499, 11: 6632, 12: 7022, 13: 1511, 14: 3248, 15: 4964, 16: 6270, 17: 1257, 18: 7655, 19: 6270}}
df = pandas.DataFrame.from_dict(data_as_dict)
df[['counts']] = df\
.groupby(by = ['ingredient_ids'], as_index = False)['ingredient_ids'].count()
df[['counts']] = df\
.groupby(by = ['ingredient_ids'], as_index = False)['ingredient_ids'].count()
data_exploded = df[['recipe_id', 'ingredient_ids', 'counts']]
data_exploded['count'] = 1
data_exploded = data_exploded.drop('counts', axis = 1)
data_exploded = data_exploded.pivot_table(values = 'count', index = 'recipe_id', columns='ingredient_ids')
data_exploded = data_exploded.replace(np.nan, 0)
from kmodes.kmodes import KModes
km_cao = KModes(n_clusters=20, init = "Cao", n_init = 1, verbose=1)
fitClusters_cao = km_cao.fit_predict(data_exploded)
fitClusters_cao
CodePudding user response:
The way around this is to covert everything to string values (which kmodes can apparently handle). So from pivot.table(), make fill_value = '', and if using binary data, also convert 1's (and 0's) to string values.
''' data_exploded[['count']] = '1'
data_exploded = data_exploded.drop('counts', axis = 1)
#data_exploded[['count']] = data_exploded[['count']].astype(int) data_exploded = data_exploded.pivot_table(index = 'recipe_id', columns='ingredient_ids', values = 'count', fill_value = '', aggfunc='sum') data_exploded #data_exploded = data_exploded.replace(0, Na)
#data_exploded = data_exploded.replace(np.nan, 0)
from kmodes.kmodes import KModes km_cao = KModes(n_clusters=25, init = "Cao", n_init = 1, verbose=1)
fitClusters_cao = km_cao.fit_predict(data_exploded) fitClusters_cao '''