Fill N/A data based on value in another column-CodePudding

I have a csv file with 2 column store_name and store_location that some store_location is missing. And I want to fill missing data with data in same column based on value in another column.

Below is my csv file:

import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/hoatranobita/app_to_cloud_4/main/store_location.csv')

Here is my expected Output:

I tried to find solutions but still not find out.

Thanks.

CodePudding user response：

TL;DR: providing 3 different approaches in case you want to:

ensure a unique value per group
fill the NaN with the first available value
fill the NaN with the previous/next non-NA row
Looks like you could need a unique value per group. Use groupby.transform('first') to get the first non-NA value:

df['store_location'] = df.groupby('store_name')['store_location'].transform('first')

output:

                             store_name                         store_location
0                       AJ's Liquor III           POINT (-93.648959 42.021456)
1                       AJ's Liquor III           POINT (-93.648959 42.021456)
2                Ambysure Inc / Clinton           POINT (-90.225022 41.833351)
3                Ambysure Inc / Clinton           POINT (-90.225022 41.833351)
4                 Bancroft Liquor Store               POINT (-94.218 43.29355)
5                 Bancroft Liquor Store               POINT (-94.218 43.29355)
6                                Bani's  POINT (-92.455801 42.518018000000005)
7                  Bani's / Cedar Falls  POINT (-92.455801 42.518018000000005)
8                  Bani's / Cedar Falls  POINT (-92.455801 42.518018000000005)
9                      Barrys Mini Mart            POINT (-91.38553 43.050183)
10                 Baxter Family Market           POINT (-93.151465 41.826715)
11             Beecher Liquor / Dubuque  POINT (-90.696886 42.500775000000004)
12           Beer on Floyd / Sioux City  POINT (-96.372185 42.531448000000005)
13                  Beer Thirty Denison           POINT (-95.360162 42.012412)
14  Beer Thirty Storm Lake / Storm Lake           POINT (-95.198584 42.646794)
15  Beer Thirty Storm Lake / Storm Lake           POINT (-95.198584 42.646794)
16  Beer Thirty Storm Lake / Storm Lake           POINT (-95.198584 42.646794)

If there are different values and you want to preserve them, you can replace the NaN with the first non-NA value:

df['store_location'] = df['store_location'].fillna(df.groupby('store_name')['store_location'].transform('first'))

output:

                             store_name                         store_location
0                       AJ's Liquor III           POINT (-93.648959 42.021456)
1                       AJ's Liquor III           POINT (-93.648959 42.021456)
2                Ambysure Inc / Clinton           POINT (-90.225022 41.833351)
3                Ambysure Inc / Clinton           POINT (-90.225022 41.833351)
4                 Bancroft Liquor Store               POINT (-94.218 43.29355)
5                 Bancroft Liquor Store               POINT (-94.218 43.29355)
6                                Bani's  POINT (-92.455801 42.518018000000005)
7                  Bani's / Cedar Falls  POINT (-92.455801 42.518018000000005)
8                  Bani's / Cedar Falls  POINT (-92.455801 42.518018000000005)
9                      Barrys Mini Mart            POINT (-91.38553 43.050183)
10                 Baxter Family Market           POINT (-93.151465 41.826715)
11             Beecher Liquor / Dubuque  POINT (-90.696886 42.500775000000004)
12           Beer on Floyd / Sioux City  POINT (-96.372185 42.531448000000005)
13                  Beer Thirty Denison           POINT (-95.360162 42.012412)
14  Beer Thirty Storm Lake / Storm Lake           POINT (-95.198584 42.646794)
15  Beer Thirty Storm Lake / Storm Lake   POINT (-95.19941700000001 42.647498)
16  Beer Thirty Storm Lake / Storm Lake           POINT (-95.198584 42.646794)

Alternatively, use the previous/next non-NA values per group with ffill bfill:

df['store_location'] = df.groupby('store_name')['store_location'].transform(lambda g: g.ffill().bfill())

output:

                             store_name                         store_location
0                       AJ's Liquor III           POINT (-93.648959 42.021456)
1                       AJ's Liquor III           POINT (-93.648959 42.021456)
2                Ambysure Inc / Clinton           POINT (-90.225022 41.833351)
3                Ambysure Inc / Clinton           POINT (-90.225022 41.833351)
4                 Bancroft Liquor Store               POINT (-94.218 43.29355)
5                 Bancroft Liquor Store               POINT (-94.218 43.29355)
6                                Bani's  POINT (-92.455801 42.518018000000005)
7                  Bani's / Cedar Falls  POINT (-92.455801 42.518018000000005)
8                  Bani's / Cedar Falls  POINT (-92.455801 42.518018000000005)
9                      Barrys Mini Mart            POINT (-91.38553 43.050183)
10                 Baxter Family Market           POINT (-93.151465 41.826715)
11             Beecher Liquor / Dubuque  POINT (-90.696886 42.500775000000004)
12           Beer on Floyd / Sioux City  POINT (-96.372185 42.531448000000005)
13                  Beer Thirty Denison           POINT (-95.360162 42.012412)
14  Beer Thirty Storm Lake / Storm Lake           POINT (-95.198584 42.646794)
15  Beer Thirty Storm Lake / Storm Lake   POINT (-95.19941700000001 42.647498)
16  Beer Thirty Storm Lake / Storm Lake   POINT (-95.19941700000001 42.647498)