I would like to fill in missing data in my dataset in the following way:
My data can contain one of the following shapes:
['circle', 'square', 'triangle']
Each shape has width and height attributes,
I would like to fill in missing data: in case my dataset is missing one of the shapes, it will add it with default values
(default values for the example: 'height' = 10, , 'width'= 5)
example: if this is my dataset:
data = [
{'shape': 'circle', 'height': 5, 'width': 4},
{'shape': 'circle', 'height': 2, 'width': 3},
{'shape': 'square', 'height': 6, , 'width': 5}
]
missing: triangle, fill it with default width & height,
result will be:
data = [
{'shape': 'circle', 'height': 5, 'width': 4},
{'shape': 'circle', 'height': 2, 'width': 3},
{'shape': 'square', 'height': 6, 'width': 2},
{'shape': 'triangle', 'height': 10, 'width': 5} # added by code with default values
]
my code:
shapes = ['circle', 'square', 'triangle']
df = pd.DataFrame(data)
# df.?? some code to fill in missing data
CodePudding user response:
I would add the entries with missing values to the data before making the dataframe, so that it already has the right number of rows. Then you can fill in the missing values by passing a dictionary with the default values to df.fillna()
:
import pandas as pd
data = [
{'shape': 'circle', 'height': 5, 'width': 4},
{'shape': 'circle', 'height': 2, 'width': 3},
{'shape': 'square', 'height': 6, 'width': 5}
]
shapes = ['circle', 'square', 'triangle']
shapes_present = {d['shape'] for d in data}
shapes_missing = set(shapes) - shapes_present
for shape in shapes_missing:
data.append({'shape': shape})
df = pd.DataFrame(data)
df.fillna({'height': 10, 'width': 5}, inplace=True)
df
shape height width
0 circle 5.0 4.0
1 circle 2.0 3.0
2 square 6.0 5.0
3 triangle 10.0 5.0
CodePudding user response:
You could make another df with the default values, something like
defaults = pd.DataFrame({'shape': shapes, 'height': 10, 'width': 5})
# result
shape height width
0 circle 10 5
1 square 10 5
2 triangle 10 5
Then concatenate the original df with the rows from the defaults
that are not yet present in df
:
pd.concat([df, defaults[~defaults['shape'].isin(df['shape'])]],
ignore_index=True)
# result
shape height width
0 circle 5 4
1 circle 2 3
2 square 6 5
3 triangle 10 5