I'm trying to expand a dataframe column of dictionaries into it's own dataframe/other columns. I have already tried using json_normalize, iteration, and list comprehension but for some reason it just returns a blank dataframe. I've attached a link to the CSV I'm working with.
its a csv file with yelp data and the this issue is occurring with all the columns of dictionaries
import matplotlib.pyplot as plt
import pandas as pd
import requests
from pandas.io.json import json_normalize
import seaborn as sns
import json
from google.colab import files
import io
uploaded = files.upload()
yelpdf = pd.read_csv(io.BytesIO(uploaded['yelp_reviews.csv']))
print(yelpdf['Ambience'])
df2 = pd.json_normalize(yelpdf['Ambience'])
print(df2.info())
print(df2.shape)
print(df2.head())
CodePudding user response:
I asked access to your file, but in the mean time let's use this example
import pandas as pd
df = pd.DataFrame({"A": [{"B":True, "C":'something'}, {"B": False, "C": "another"}]})
That generate this table:
A |
---|
{"B":True, "C":'something'} |
{"B": False, "C": "another"} |
To convert each key of the dictionaries to a newer column and have that with the original DataFrame you can do:
temp = pd.DataFrame([*df['A']], df.index)
df.join(temp)
With output as
A | B | C |
---|---|---|
{"B":True, "C":'something'} | True | something |
{"B": False, "C": "another"} | False | another |
EDIT
Okay now that I have your data I see where's the problem. The same thing I did on the example is valid for you, but in your case when we load the csv the Ambience column comes as string:
import pandas as pd
df = pd.read_csv('...')
df["Ambience"].values[0]
>>> "{'touristy': False, 'hipster': True, 'romantic': False, 'divey': False, 'intimate': False, 'trendy': True, 'upscale': False, 'classy': False, 'casual': True}"
To use the same approach in the first example we have to transform the string representation of the dictionary to an actual dictionary, to do that we can do:
import ast
df["Ambience"] = df["Ambience"].fillna("{}") # Filling NaN with an empty dict representation
amb = df["Ambience"].apply(lambda x: ast.literal_eval(x)).copy() # Using ast literal_eval to convert string representation to actual dict
amb = amb.apply(lambda x: x if x else None) # Making empty dicts as None
amb = amb.dropna() # Dropping None's
With that we've created a variable called amb
that contains a pandas series of dictionaries with the same index of the original dataframe. So now we can finnaly do:
temp = pd.DataFrame([*amb], amb.index) # Creating the temp dataframe where cols are the keys from the dictionaries
df = df.join(temp) # Joining with original dataframe
CodePudding user response:
The issue is the elements in the 'Ambience' are strings not dictionaries. You just need to convert to dictionaries first before using json_normalize
. You can do this using the literal_eval
function within the ast Python package.
yelp_df['Ambience'] = yelp_df['Ambience'].apply(lambda x: ast.literal_eval(x) if pd.notnull(x) else x)
There are some nans in the column, so you'll just need to convert only the strings and leave the nan's as is.
Then you can just run your code as normal -
df2 = pd.json_normalize(yelp_df['Ambience'])
Which should yield the dataframe you want.