Home > Blockchain >  Parse data from a dict with condition - pandas dataframe
Parse data from a dict with condition - pandas dataframe

Time:04-11

My pandas DataFrame has a few missing and bad values. I'd like to replace / fill this by parse data from a dictionary stored in a pandas series. Here's an example:

import pandas as pd

df = pd.DataFrame({'Addr': ['123 Street, City, 85036', '234 Street1, City, 85036', '542js'],
                   'Lat': [32.312, 33.312, np.nan],
                   'CL':  [{'street':'134 Street name',
                            'city':'City name',
                            'zip':'85312',
                            'latitude': 34.661056,
                            'longitude': -118.146189},

                           {'street':'134 Street name',
                            'city':'City name',
                            'zip':'85312',
                            'latitude': 34.661056,
                            'longitude': -118.146189},
                          
                           {'street':'134 Str',
                            'city':'phx',
                            'zip':'85312',
                            'latitude': 34.661056,
                            'longitude': -118.146189}]

                  })

For rows where Lat is np.nan, I'd like to parse the data from CL column. After filling the data from dict, the 2 columns of the row would look like this:

Addr               Lat

134 Str phx 85312  34.661056 

p.s In reality, the dict is fairly long. So, I'd prefer a way to extract only the values that are needed, in this case Lat and street, city and zip which makes up the Addr column.

CodePudding user response:

You can normalize the 'CL' column and join the newly created columns to 'Addr' and 'Lat'. Then change the values of Lat to 'latitude' where it's np.nan:

df = df[['Addr', 'Lat']].join(pd.json_normalize(df['CL']))
df.loc[df['Lat'].isna(), 'Lat'] = df.loc[df['Lat'].isna(), 'latitude']
print(df)

Output:

                       Addr        Lat           street       city    zip   latitude   longitude
0   123 Street, City, 85036  32.312000  134 Street name  City name  85312  34.661056 -118.146189
1  234 Street1, City, 85036  33.312000  134 Street name  City name  85312  34.661056 -118.146189
2                     542js  34.661056          134 Str        phx  85312  34.661056 -118.146189

Edit: after reading your comments and edited question, it seems you don't want to build such a huge df, rather work from your dictionnary:

your_dict = {'Addr': ['123 Street, City, 85036', '234 Street1, City, 85036', '542js'],
                   'Lat': [32.312, 33.312, np.nan],
                   'CL':  [{'street':'134 Street name',
                            'city':'City name',
                            'zip':'85312',
                            'latitude': 34.661056,
                            'longitude': -118.146189},

                           {'street':'134 Street name',
                            'city':'City name',
                            'zip':'85312',
                            'latitude': 34.661056,
                            'longitude': -118.146189},
                          
                           {'street':'134 Str',
                            'city':'phx',
                            'zip':'85312',
                            'latitude': 34.661056,
                            'longitude': -118.146189}]
                  }

df_lat = pd.Series(your_dict['Lat'])
df_cl = pd.DataFrame(your_dict['CL'])  
print(df_cl.loc[df_lat.isna(), ['latitude', 'street', 'city', 'zip']])

That way only rows with 'Lat' initially equal to np.nan will be considered:

    latitude   street city    zip
2  34.661056  134 Str  phx  85312

CodePudding user response:

If it's only that one column you want then

>>> df['Lat'] = df['Lat'].fillna(pd.DataFrame(df['CL'].tolist())['latitude'])

>>> df
                       Addr        Lat                                                 CL
0   123 Street, City, 85036  32.312000  {'street': '134 Street name', 'city': 'City na...
1  234 Street1, City, 85036  33.312000  {'street': '134 Street name', 'city': 'City na...
2                     542js  34.661056  {'street': '134 Str', 'city': 'phx', 'zip': '8...

If the dict is too long for memory then you can parse it with a for loop, convert to df and then fillna

keys = []
for i in df['CL'].tolist():
    keys.append({'Lat': i['Lat'], 'street': i['street'],'city': i['city'],'zip': i['zip']})

ddf = pd.DataFrame(keys)
  • Related