Assume I have a dataset that contains around 100 000 rows and 50 columns. I have information about the sellers and their products. The part of the dataset will look somehow like this:
seller_id | product_id | seller_is_checked | size | color |
---|---|---|---|---|
A100 | UN76UH | 1 | uni size | red |
B200 | HJHLI90 | 0 | small | blue |
C300 | UUKB89 | 0 | large | green |
<...> | <...> | <...> | <...> | <...> |
A100 | BxYJHG | NA | medium | purple |
AXYZ215 | HHIOTY | 1 | large | unknown |
In the table you can see that there are at least two seller_id
s as these seller has several products. However, this time there was a mistake while entering the data and the information whether the seller_is_checked
got missing.
Is there a function in Python/pandas that will help to look through the data set and substitute the missing value with the actual one from the same data set?
CodePudding user response:
You can do this using pandas for example:
import pandas as pd
# Read the data into DataFrame which is basically a two dimensional array
df = pd.read_csv("you_csv_file.csv")
# Print if there are null values
print(df.isna().sum())
CodePudding user response:
You can solve this by creating a dictionary for seller_id and then updating the seller_is_checked info. Follow me (assuming you are using pandas):
1 - remove lines where seller_is_checked info is missing and create a new dataset, seller_dict_df with the results
seller_dict_df = df.dropna()
2 - create the dictionary
seller_dict = dict(
zip(
seller_dict_df['seller_id'], seller_dict_df['seller_is_checked']
)
)
3 - update the original table
df['seller_is_checked'] = df['seller_is_checked'].replace(seller_dict)
CodePudding user response:
A possible solution, which is based on the idea of filling downwards and then upwards the missing values with a valid observation in each group of seller_id
(pandas.DataFrame.ffill
and pandas.DataFrame.bfill
):
df.seller_is_checked = df.groupby('seller_id')['seller_is_checked'].ffill().bfill()
print(df)
Output:
seller_id product_id seller_is_checked size color
0 A100 UN76UH 1.0 uni size red
1 B200 HJHLI90 0.0 small blue
2 C300 UUKB89 0.0 large green
3 A100 BxYJHG 1.0 medium purple
4 AXYZ215 HHIOTY 1.0 large unknown