Parse data from a dict with condition - pandas dataframe-CodePudding

My pandas DataFrame has a few missing and bad values. I'd like to replace / fill this by parse data from a dictionary stored in a pandas series. Here's an example:

import pandas as pd

df = pd.DataFrame({'Addr': ['123 Street, City, 85036', '234 Street1, City, 85036', '542js'],
                   'Lat': [32.312, 33.312, np.nan],
                   'CL':  [{'street':'134 Street name',
                            'city':'City name',
                            'zip':'85312',
                            'latitude': 34.661056,
                            'longitude': -118.146189},

                           {'street':'134 Street name',
                            'city':'City name',
                            'zip':'85312',
                            'latitude': 34.661056,
                            'longitude': -118.146189},
                          
                           {'street':'134 Str',
                            'city':'phx',
                            'zip':'85312',
                            'latitude': 34.661056,
                            'longitude': -118.146189}]

                  })

For rows where Lat is np.nan, I'd like to parse the data from CL column. After filling the data from dict, the 2 columns of the row would look like this:

Addr               Lat

134 Str phx 85312  34.661056

p.s In reality, the dict is fairly long. So, I'd prefer a way to extract only the values that are needed, in this case Lat and street, city and zip which makes up the Addr column.

CodePudding user response：

You can normalize the 'CL' column and join the newly created columns to 'Addr' and 'Lat'. Then change the values of Lat to 'latitude' where it's np.nan:

df = df[['Addr', 'Lat']].join(pd.json_normalize(df['CL']))
df.loc[df['Lat'].isna(), 'Lat'] = df.loc[df['Lat'].isna(), 'latitude']
print(df)

Output:

                       Addr        Lat           street       city    zip   latitude   longitude
0   123 Street, City, 85036  32.312000  134 Street name  City name  85312  34.661056 -118.146189
1  234 Street1, City, 85036  33.312000  134 Street name  City name  85312  34.661056 -118.146189
2                     542js  34.661056          134 Str        phx  85312  34.661056 -118.146189

Edit: after reading your comments and edited question, it seems you don't want to build such a huge df, rather work from your dictionnary:

your_dict = {'Addr': ['123 Street, City, 85036', '234 Street1, City, 85036', '542js'],
                   'Lat': [32.312, 33.312, np.nan],
                   'CL':  [{'street':'134 Street name',
                            'city':'City name',
                            'zip':'85312',
                            'latitude': 34.661056,
                            'longitude': -118.146189},

                           {'street':'134 Street name',
                            'city':'City name',
                            'zip':'85312',
                            'latitude': 34.661056,
                            'longitude': -118.146189},
                          
                           {'street':'134 Str',
                            'city':'phx',
                            'zip':'85312',
                            'latitude': 34.661056,
                            'longitude': -118.146189}]
                  }

df_lat = pd.Series(your_dict['Lat'])
df_cl = pd.DataFrame(your_dict['CL'])  
print(df_cl.loc[df_lat.isna(), ['latitude', 'street', 'city', 'zip']])

That way only rows with 'Lat' initially equal to np.nan will be considered:

    latitude   street city    zip
2  34.661056  134 Str  phx  85312

CodePudding user response：

If it's only that one column you want then

>>> df['Lat'] = df['Lat'].fillna(pd.DataFrame(df['CL'].tolist())['latitude'])

>>> df
                       Addr        Lat                                                 CL
0   123 Street, City, 85036  32.312000  {'street': '134 Street name', 'city': 'City na...
1  234 Street1, City, 85036  33.312000  {'street': '134 Street name', 'city': 'City na...
2                     542js  34.661056  {'street': '134 Str', 'city': 'phx', 'zip': '8...

If the dict is too long for memory then you can parse it with a for loop, convert to df and then fillna

keys = []
for i in df['CL'].tolist():
    keys.append({'Lat': i['Lat'], 'street': i['street'],'city': i['city'],'zip': i['zip']})

ddf = pd.DataFrame(keys)