Home > front end >  pandas - fill in default values for missing records
pandas - fill in default values for missing records

Time:05-23

I would like to fill in missing data in my dataset in the following way:
My data can contain one of the following shapes:

['circle', 'square', 'triangle']

Each shape has width and height attributes,
I would like to fill in missing data: in case my dataset is missing one of the shapes, it will add it with default values
(default values for the example: 'height' = 10, , 'width'= 5)

example: if this is my dataset:

data = [
 {'shape': 'circle', 'height': 5, 'width': 4},
 {'shape': 'circle', 'height': 2, 'width': 3},
 {'shape': 'square', 'height': 6, , 'width': 5}
]

missing: triangle, fill it with default width & height,
result will be:

data = [
 {'shape': 'circle', 'height': 5, 'width': 4},
 {'shape': 'circle', 'height': 2, 'width': 3},
 {'shape': 'square', 'height': 6, 'width': 2},
 {'shape': 'triangle', 'height': 10, 'width': 5} # added by code with default values
]

my code:

shapes = ['circle', 'square', 'triangle']
df = pd.DataFrame(data)
# df.?? some code to fill in missing data

CodePudding user response:

I would add the entries with missing values to the data before making the dataframe, so that it already has the right number of rows. Then you can fill in the missing values by passing a dictionary with the default values to df.fillna():

import pandas as pd

data = [
 {'shape': 'circle', 'height': 5, 'width': 4},
 {'shape': 'circle', 'height': 2, 'width': 3},
 {'shape': 'square', 'height': 6, 'width': 5}
]

shapes = ['circle', 'square', 'triangle']

shapes_present = {d['shape'] for d in data}
shapes_missing = set(shapes) - shapes_present

for shape in shapes_missing:
    data.append({'shape': shape})
    
df = pd.DataFrame(data)
df.fillna({'height': 10, 'width': 5}, inplace=True)
df
    shape       height  width
0   circle      5.0     4.0
1   circle      2.0     3.0
2   square      6.0     5.0
3   triangle    10.0    5.0

CodePudding user response:

You could make another df with the default values, something like

defaults = pd.DataFrame({'shape': shapes, 'height': 10, 'width': 5})

# result
      shape  height  width
0    circle      10      5
1    square      10      5
2  triangle      10      5

Then concatenate the original df with the rows from the defaults that are not yet present in df:

pd.concat([df, defaults[~defaults['shape'].isin(df['shape'])]], 
          ignore_index=True)

# result
      shape  height  width
0    circle       5      4
1    circle       2      3
2    square       6      5
3  triangle      10      5
  • Related