My pandas DataFrame has a few missing and bad values. I'd like to replace / fill this by parse data from a dictionary stored in a pandas series. Here's an example:
import pandas as pd
df = pd.DataFrame({'Addr': ['123 Street, City, 85036', '234 Street1, City, 85036', '542js'],
'Lat': [32.312, 33.312, np.nan],
'CL': [{'street':'134 Street name',
'city':'City name',
'zip':'85312',
'latitude': 34.661056,
'longitude': -118.146189},
{'street':'134 Street name',
'city':'City name',
'zip':'85312',
'latitude': 34.661056,
'longitude': -118.146189},
{'street':'134 Str',
'city':'phx',
'zip':'85312',
'latitude': 34.661056,
'longitude': -118.146189}]
})
For rows where Lat
is np.nan
, I'd like to parse the data from CL
column. After filling the data from dict, the 2 columns of the row would look like this:
Addr Lat
134 Str phx 85312 34.661056
p.s In reality, the dict is fairly long. So, I'd prefer a way to extract only the values that are needed, in this case Lat
and street
, city
and zip
which makes up the Addr
column.
CodePudding user response:
You can normalize the 'CL' column and join the newly created columns to 'Addr' and 'Lat'. Then change the values of Lat
to 'latitude' where it's np.nan
:
df = df[['Addr', 'Lat']].join(pd.json_normalize(df['CL']))
df.loc[df['Lat'].isna(), 'Lat'] = df.loc[df['Lat'].isna(), 'latitude']
print(df)
Output:
Addr Lat street city zip latitude longitude
0 123 Street, City, 85036 32.312000 134 Street name City name 85312 34.661056 -118.146189
1 234 Street1, City, 85036 33.312000 134 Street name City name 85312 34.661056 -118.146189
2 542js 34.661056 134 Str phx 85312 34.661056 -118.146189
Edit: after reading your comments and edited question, it seems you don't want to build such a huge df, rather work from your dictionnary:
your_dict = {'Addr': ['123 Street, City, 85036', '234 Street1, City, 85036', '542js'],
'Lat': [32.312, 33.312, np.nan],
'CL': [{'street':'134 Street name',
'city':'City name',
'zip':'85312',
'latitude': 34.661056,
'longitude': -118.146189},
{'street':'134 Street name',
'city':'City name',
'zip':'85312',
'latitude': 34.661056,
'longitude': -118.146189},
{'street':'134 Str',
'city':'phx',
'zip':'85312',
'latitude': 34.661056,
'longitude': -118.146189}]
}
df_lat = pd.Series(your_dict['Lat'])
df_cl = pd.DataFrame(your_dict['CL'])
print(df_cl.loc[df_lat.isna(), ['latitude', 'street', 'city', 'zip']])
That way only rows with 'Lat' initially equal to np.nan
will be considered:
latitude street city zip
2 34.661056 134 Str phx 85312
CodePudding user response:
If it's only that one column you want then
>>> df['Lat'] = df['Lat'].fillna(pd.DataFrame(df['CL'].tolist())['latitude'])
>>> df
Addr Lat CL
0 123 Street, City, 85036 32.312000 {'street': '134 Street name', 'city': 'City na...
1 234 Street1, City, 85036 33.312000 {'street': '134 Street name', 'city': 'City na...
2 542js 34.661056 {'street': '134 Str', 'city': 'phx', 'zip': '8...
If the dict is too long for memory then you can parse it with a for loop, convert to df and then fillna
keys = []
for i in df['CL'].tolist():
keys.append({'Lat': i['Lat'], 'street': i['street'],'city': i['city'],'zip': i['zip']})
ddf = pd.DataFrame(keys)