Having trouble expanding/normalizing a dataframe column of dictionary values into a dataframe/ other-CodePudding

enter image description here

I'm trying to expand a dataframe column of dictionaries into it's own dataframe/other columns. I have already tried using json_normalize, iteration, and list comprehension but for some reason it just returns a blank dataframe. I've attached a link to the CSV I'm working with.

its a csv file with yelp data and the this issue is occurring with all the columns of dictionaries

import matplotlib.pyplot as plt
import pandas as pd
import requests 
from pandas.io.json import json_normalize 
import seaborn as sns
import json
from google.colab import files
import io

uploaded = files.upload()
yelpdf = pd.read_csv(io.BytesIO(uploaded['yelp_reviews.csv']))

print(yelpdf['Ambience'])
df2 = pd.json_normalize(yelpdf['Ambience'])
print(df2.info())
print(df2.shape)
print(df2.head())

CodePudding user response：

I asked access to your file, but in the mean time let's use this example

import pandas as pd

df = pd.DataFrame({"A": [{"B":True, "C":'something'}, {"B": False, "C": "another"}]})

That generate this table:

A
{"B":True, "C":'something'}
{"B": False, "C": "another"}

To convert each key of the dictionaries to a newer column and have that with the original DataFrame you can do:

temp = pd.DataFrame([*df['A']], df.index)
df.join(temp)

With output as

A	B	C
{"B":True, "C":'something'}	True	something
{"B": False, "C": "another"}	False	another

EDIT

Okay now that I have your data I see where's the problem. The same thing I did on the example is valid for you, but in your case when we load the csv the Ambience column comes as string:

import pandas as pd 

df = pd.read_csv('...')
df["Ambience"].values[0]
>>> "{'touristy': False, 'hipster': True, 'romantic': False, 'divey': False, 'intimate': False, 'trendy': True, 'upscale': False, 'classy': False, 'casual': True}"

To use the same approach in the first example we have to transform the string representation of the dictionary to an actual dictionary, to do that we can do:

import ast

df["Ambience"] =  df["Ambience"].fillna("{}") # Filling NaN with an empty dict representation
amb = df["Ambience"].apply(lambda x: ast.literal_eval(x)).copy() # Using ast literal_eval to convert string representation to actual dict
amb = amb.apply(lambda x: x if x else None) # Making empty dicts as None
amb = amb.dropna() # Dropping None's

With that we've created a variable called amb that contains a pandas series of dictionaries with the same index of the original dataframe. So now we can finnaly do:

temp = pd.DataFrame([*amb], amb.index) # Creating the temp dataframe where cols are the keys from the dictionaries
df = df.join(temp) # Joining with original dataframe

CodePudding user response：

The issue is the elements in the 'Ambience' are strings not dictionaries. You just need to convert to dictionaries first before using json_normalize. You can do this using the literal_eval function within the ast Python package.

yelp_df['Ambience'] = yelp_df['Ambience'].apply(lambda x: ast.literal_eval(x) if pd.notnull(x) else x)

There are some nans in the column, so you'll just need to convert only the strings and leave the nan's as is.

Then you can just run your code as normal -

df2 = pd.json_normalize(yelp_df['Ambience'])

Which should yield the dataframe you want.