Creating/filling none existing groups in after group-by using PyArrow/pandas-CodePudding

Let's say I have a data set

date	sparrow type	var
2022-05-01	red	100
2022-05-01	blue	100
2022-05-02	red	20
2022-05-03	blue	30
2022-05-03	green	40

and I want to fill missing [date,apple type] like so:

if it apeared in the past (before the missing date) have the same value as the most updated date.
if it exists in the future but not in the past put null in its place.

Result should we somthing like

date	sparrow type	var
2022-05-01	red	100
2022-05-01	blue	100
2022-05-01	green	Null
2022-05-02	red	20
2022-05-02	blue	100
2022-05-02	green	Null
2022-05-03	red	20
2022-05-03	blue	30
2022-05-03	green	40

I would like to do it in pyArrow or pandas, any ideas?

CodePudding user response：

You first need to create all combinations of your date and sparrow_type columns using itertools.product. Assuming your data are in df:

import pandas as pd
import itertools


all_dates = df.date.unique()
all_sparrow_types = df.sparrow_type.unique()
all_combinations = pd.DataFrame(columns=['date', 'sparrow_type'], data = list(itertools.product(all_dates, all_sparrow_types)))

all_combinations is a pd.DataFrame that contains all the combinations of the date and sparrow_type columns. It can now be merged with the original df:

df_out = all_combinations.merge(df, how='left')

Finally you can forward fill the NAs per group in the var column. Be sure to sort the dates/sparrow_types first:

df_out = df_out.sort_values(by=['date', 'sparrow_type'])
df_out['var'] = df_out.groupby(['sparrow_type'])['var'].fillna(method='ffill')

CodePudding user response：

@notiv's solution is compact.

If you need more flexibility for complex processing logic, then you can consider using a function for that.

The following solution:

Groups data by date (it is easy to process).
Fills missing values using required logic.
Unrolls data grouped earlier.

Logic wise it achieves the desired results.

import pandas as pd

data = [
  ("2022-05-01", "red", 100),
  ("2022-05-01", "blue", 100),
  ("2022-05-02", "red", 20),
  ("2022-05-03", "blue", 30),
  ("2022-05-03", "green", 40),
]

df = pd.DataFrame(data=data, columns=["date", "type", "var"])

Group data by date as arrays:

df = df.groupby("date") \
  .agg( \
       type_arr=("type", list), \
       var_arr=("var", list), \
      ) \
  .reset_index()

print(df)
         date       type_arr     var_arr
0  2022-05-01    [red, blue]  [100, 100]
1  2022-05-02          [red]        [20]
2  2022-05-03  [blue, green]    [30, 40]

Get all distinct types:

all_types = set([typ for sublist in df["type_arr"].to_list() for typ in sublist])
print(all_types)
>> {'blue', 'green', 'red'}

Fill missing values using required logic:

def fill_missing(row):
  new_type_arr = []   row["type_arr"]
  new_var_arr = []   row["var_arr"]
  for typ in all_types:
    if typ not in row["type_arr"]:
      latest_rows = df[[(typ in typ_arr) for typ_arr in df["type_arr"]]]
      latest_row = latest_rows[latest_rows.index < row.name][-1:]
      latest_value = None
      if not latest_row.empty:
        type_idx = latest_row["type_arr"].values.tolist()[0].index(typ)
        latest_value = latest_row["var_arr"].values.tolist()[0][type_idx]
      new_type_arr.append(typ)
      new_var_arr.append(latest_value)
  return pd.Series([new_type_arr, new_var_arr])
# 
df[["new_type_arr", "new_var_arr"]] = df.apply(fill_missing, axis=1)

print(df)
         date       type_arr     var_arr        new_type_arr       new_var_arr
0  2022-05-01    [red, blue]  [100, 100]  [red, blue, green]  [100, 100, None]
1  2022-05-02          [red]        [20]  [red, blue, green]   [20, 100, None]
2  2022-05-03  [blue, green]    [30, 40]  [blue, green, red]      [30, 40, 20]

Unroll array into individual values:

df = df[["date", "new_type_arr", "new_var_arr"]]
df = df.set_index(["date"]).apply(pd.Series.explode).reset_index()

print(df)
         date new_type_arr new_var_arr
0  2022-05-01          red         100
1  2022-05-01         blue         100
2  2022-05-01        green        None
3  2022-05-02          red          20
4  2022-05-02         blue         100
5  2022-05-02        green        None
6  2022-05-03         blue          30
7  2022-05-03        green          40
8  2022-05-03          red          20