Home > OS >  pandas function to check if there exist non-NA values for the same ids?
pandas function to check if there exist non-NA values for the same ids?

Time:08-16

Assume I have a dataset that contains around 100 000 rows and 50 columns. I have information about the sellers and their products. The part of the dataset will look somehow like this:

seller_id product_id seller_is_checked size color
A100 UN76UH 1 uni size red
B200 HJHLI90 0 small blue
C300 UUKB89 0 large green
<...> <...> <...> <...> <...>
A100 BxYJHG NA medium purple
AXYZ215 HHIOTY 1 large unknown

In the table you can see that there are at least two seller_ids as these seller has several products. However, this time there was a mistake while entering the data and the information whether the seller_is_checked got missing.

Is there a function in Python/pandas that will help to look through the data set and substitute the missing value with the actual one from the same data set?

CodePudding user response:

You can do this using pandas for example:

import pandas as pd

# Read the data into DataFrame which is basically a two dimensional array
df = pd.read_csv("you_csv_file.csv")

# Print if there are null values
print(df.isna().sum())

CodePudding user response:

You can solve this by creating a dictionary for seller_id and then updating the seller_is_checked info. Follow me (assuming you are using pandas):

1 - remove lines where seller_is_checked info is missing and create a new dataset, seller_dict_df with the results

seller_dict_df = df.dropna()

2 - create the dictionary

seller_dict = dict(
    zip(
        seller_dict_df['seller_id'], seller_dict_df['seller_is_checked']
        )
    )

3 - update the original table

df['seller_is_checked'] = df['seller_is_checked'].replace(seller_dict)

CodePudding user response:

A possible solution, which is based on the idea of filling downwards and then upwards the missing values with a valid observation in each group of seller_id (pandas.DataFrame.ffill and pandas.DataFrame.bfill):

df.seller_is_checked = df.groupby('seller_id')['seller_is_checked'].ffill().bfill()
print(df)

Output:

  seller_id product_id  seller_is_checked      size    color
0      A100     UN76UH                1.0  uni size      red
1      B200    HJHLI90                0.0     small     blue
2      C300     UUKB89                0.0     large    green
3      A100     BxYJHG                1.0    medium   purple
4   AXYZ215     HHIOTY                1.0     large  unknown
  • Related