pandas function to check if there exist non-NA values for the same ids?-CodePudding

Assume I have a dataset that contains around 100 000 rows and 50 columns. I have information about the sellers and their products. The part of the dataset will look somehow like this:

seller_id	product_id	seller_is_checked	size	color
A100	UN76UH	1	uni size	red
B200	HJHLI90	0	small	blue
C300	UUKB89	0	large	green
<...>	<...>	<...>	<...>	<...>
A100	BxYJHG	NA	medium	purple
AXYZ215	HHIOTY	1	large	unknown

In the table you can see that there are at least two seller_ids as these seller has several products. However, this time there was a mistake while entering the data and the information whether the seller_is_checked got missing.

Is there a function in Python/pandas that will help to look through the data set and substitute the missing value with the actual one from the same data set?

CodePudding user response：

You can do this using pandas for example:

import pandas as pd

# Read the data into DataFrame which is basically a two dimensional array
df = pd.read_csv("you_csv_file.csv")

# Print if there are null values
print(df.isna().sum())

CodePudding user response：

You can solve this by creating a dictionary for seller_id and then updating the seller_is_checked info. Follow me (assuming you are using pandas):

1 - remove lines where seller_is_checked info is missing and create a new dataset, seller_dict_df with the results

seller_dict_df = df.dropna()

2 - create the dictionary

seller_dict = dict(
    zip(
        seller_dict_df['seller_id'], seller_dict_df['seller_is_checked']
        )
    )

3 - update the original table

df['seller_is_checked'] = df['seller_is_checked'].replace(seller_dict)

CodePudding user response：

A possible solution, which is based on the idea of filling downwards and then upwards the missing values with a valid observation in each group of seller_id (pandas.DataFrame.ffill and pandas.DataFrame.bfill):

df.seller_is_checked = df.groupby('seller_id')['seller_is_checked'].ffill().bfill()
print(df)

Output:

  seller_id product_id  seller_is_checked      size    color
0      A100     UN76UH                1.0  uni size      red
1      B200    HJHLI90                0.0     small     blue
2      C300     UUKB89                0.0     large    green
3      A100     BxYJHG                1.0    medium   purple
4   AXYZ215     HHIOTY                1.0     large  unknown