Home > Mobile >  Creating/filling none existing groups in after group-by using PyArrow/pandas
Creating/filling none existing groups in after group-by using PyArrow/pandas

Time:09-06

Let's say I have a data set

date sparrow type var
2022-05-01 red 100
2022-05-01 blue 100
2022-05-02 red 20
2022-05-03 blue 30
2022-05-03 green 40

and I want to fill missing [date,apple type] like so:

  • if it apeared in the past (before the missing date) have the same value as the most updated date.
  • if it exists in the future but not in the past put null in its place.

Result should we somthing like

date sparrow type var
2022-05-01 red 100
2022-05-01 blue 100
2022-05-01 green Null
2022-05-02 red 20
2022-05-02 blue 100
2022-05-02 green Null
2022-05-03 red 20
2022-05-03 blue 30
2022-05-03 green 40

I would like to do it in pyArrow or pandas, any ideas?

CodePudding user response:

You first need to create all combinations of your date and sparrow_type columns using itertools.product. Assuming your data are in df:

import pandas as pd
import itertools


all_dates = df.date.unique()
all_sparrow_types = df.sparrow_type.unique()
all_combinations = pd.DataFrame(columns=['date', 'sparrow_type'], data = list(itertools.product(all_dates, all_sparrow_types)))

all_combinations is a pd.DataFrame that contains all the combinations of the date and sparrow_type columns. It can now be merged with the original df:

df_out = all_combinations.merge(df, how='left')

Finally you can forward fill the NAs per group in the var column. Be sure to sort the dates/sparrow_types first:

df_out = df_out.sort_values(by=['date', 'sparrow_type'])
df_out['var'] = df_out.groupby(['sparrow_type'])['var'].fillna(method='ffill')

CodePudding user response:

@notiv's solution is compact.

If you need more flexibility for complex processing logic, then you can consider using a function for that.

The following solution:

  • Groups data by date (it is easy to process).
  • Fills missing values using required logic.
  • Unrolls data grouped earlier.

Logic wise it achieves the desired results.

import pandas as pd

data = [
  ("2022-05-01", "red", 100),
  ("2022-05-01", "blue", 100),
  ("2022-05-02", "red", 20),
  ("2022-05-03", "blue", 30),
  ("2022-05-03", "green", 40),
]

df = pd.DataFrame(data=data, columns=["date", "type", "var"])

Group data by date as arrays:

df = df.groupby("date") \
  .agg( \
       type_arr=("type", list), \
       var_arr=("var", list), \
      ) \
  .reset_index()

print(df)
         date       type_arr     var_arr
0  2022-05-01    [red, blue]  [100, 100]
1  2022-05-02          [red]        [20]
2  2022-05-03  [blue, green]    [30, 40]

Get all distinct types:

all_types = set([typ for sublist in df["type_arr"].to_list() for typ in sublist])
print(all_types)
>> {'blue', 'green', 'red'}

Fill missing values using required logic:

def fill_missing(row):
  new_type_arr = []   row["type_arr"]
  new_var_arr = []   row["var_arr"]
  for typ in all_types:
    if typ not in row["type_arr"]:
      latest_rows = df[[(typ in typ_arr) for typ_arr in df["type_arr"]]]
      latest_row = latest_rows[latest_rows.index < row.name][-1:]
      latest_value = None
      if not latest_row.empty:
        type_idx = latest_row["type_arr"].values.tolist()[0].index(typ)
        latest_value = latest_row["var_arr"].values.tolist()[0][type_idx]
      new_type_arr.append(typ)
      new_var_arr.append(latest_value)
  return pd.Series([new_type_arr, new_var_arr])
# 
df[["new_type_arr", "new_var_arr"]] = df.apply(fill_missing, axis=1)

print(df)
         date       type_arr     var_arr        new_type_arr       new_var_arr
0  2022-05-01    [red, blue]  [100, 100]  [red, blue, green]  [100, 100, None]
1  2022-05-02          [red]        [20]  [red, blue, green]   [20, 100, None]
2  2022-05-03  [blue, green]    [30, 40]  [blue, green, red]      [30, 40, 20]

Unroll array into individual values:

df = df[["date", "new_type_arr", "new_var_arr"]]
df = df.set_index(["date"]).apply(pd.Series.explode).reset_index()

print(df)
         date new_type_arr new_var_arr
0  2022-05-01          red         100
1  2022-05-01         blue         100
2  2022-05-01        green        None
3  2022-05-02          red          20
4  2022-05-02         blue         100
5  2022-05-02        green        None
6  2022-05-03         blue          30
7  2022-05-03        green          40
8  2022-05-03          red          20
  • Related