Let's say I have a data set
date | sparrow type | var |
---|---|---|
2022-05-01 | red | 100 |
2022-05-01 | blue | 100 |
2022-05-02 | red | 20 |
2022-05-03 | blue | 30 |
2022-05-03 | green | 40 |
and I want to fill missing [date,apple type] like so:
- if it apeared in the past (before the missing date) have the same value as the most updated date.
- if it exists in the future but not in the past put null in its place.
Result should we somthing like
date | sparrow type | var |
---|---|---|
2022-05-01 | red | 100 |
2022-05-01 | blue | 100 |
2022-05-01 | green | Null |
2022-05-02 | red | 20 |
2022-05-02 | blue | 100 |
2022-05-02 | green | Null |
2022-05-03 | red | 20 |
2022-05-03 | blue | 30 |
2022-05-03 | green | 40 |
I would like to do it in pyArrow or pandas, any ideas?
CodePudding user response:
You first need to create all combinations of your date
and sparrow_type
columns using itertools.product
. Assuming your data are in df
:
import pandas as pd
import itertools
all_dates = df.date.unique()
all_sparrow_types = df.sparrow_type.unique()
all_combinations = pd.DataFrame(columns=['date', 'sparrow_type'], data = list(itertools.product(all_dates, all_sparrow_types)))
all_combinations
is a pd.DataFrame
that contains all the combinations of the date
and sparrow_type
columns. It can now be merged with the original df:
df_out = all_combinations.merge(df, how='left')
Finally you can forward fill the NA
s per group in the var
column. Be sure to sort the dates/sparrow_types first:
df_out = df_out.sort_values(by=['date', 'sparrow_type'])
df_out['var'] = df_out.groupby(['sparrow_type'])['var'].fillna(method='ffill')
CodePudding user response:
@notiv's solution is compact.
If you need more flexibility for complex processing logic, then you can consider using a function for that.
The following solution:
- Groups data by date (it is easy to process).
- Fills missing values using required logic.
- Unrolls data grouped earlier.
Logic wise it achieves the desired results.
import pandas as pd
data = [
("2022-05-01", "red", 100),
("2022-05-01", "blue", 100),
("2022-05-02", "red", 20),
("2022-05-03", "blue", 30),
("2022-05-03", "green", 40),
]
df = pd.DataFrame(data=data, columns=["date", "type", "var"])
Group data by date
as arrays:
df = df.groupby("date") \
.agg( \
type_arr=("type", list), \
var_arr=("var", list), \
) \
.reset_index()
print(df)
date type_arr var_arr
0 2022-05-01 [red, blue] [100, 100]
1 2022-05-02 [red] [20]
2 2022-05-03 [blue, green] [30, 40]
Get all distinct types:
all_types = set([typ for sublist in df["type_arr"].to_list() for typ in sublist])
print(all_types)
>> {'blue', 'green', 'red'}
Fill missing values using required logic:
def fill_missing(row):
new_type_arr = [] row["type_arr"]
new_var_arr = [] row["var_arr"]
for typ in all_types:
if typ not in row["type_arr"]:
latest_rows = df[[(typ in typ_arr) for typ_arr in df["type_arr"]]]
latest_row = latest_rows[latest_rows.index < row.name][-1:]
latest_value = None
if not latest_row.empty:
type_idx = latest_row["type_arr"].values.tolist()[0].index(typ)
latest_value = latest_row["var_arr"].values.tolist()[0][type_idx]
new_type_arr.append(typ)
new_var_arr.append(latest_value)
return pd.Series([new_type_arr, new_var_arr])
#
df[["new_type_arr", "new_var_arr"]] = df.apply(fill_missing, axis=1)
print(df)
date type_arr var_arr new_type_arr new_var_arr
0 2022-05-01 [red, blue] [100, 100] [red, blue, green] [100, 100, None]
1 2022-05-02 [red] [20] [red, blue, green] [20, 100, None]
2 2022-05-03 [blue, green] [30, 40] [blue, green, red] [30, 40, 20]
Unroll array into individual values:
df = df[["date", "new_type_arr", "new_var_arr"]]
df = df.set_index(["date"]).apply(pd.Series.explode).reset_index()
print(df)
date new_type_arr new_var_arr
0 2022-05-01 red 100
1 2022-05-01 blue 100
2 2022-05-01 green None
3 2022-05-02 red 20
4 2022-05-02 blue 100
5 2022-05-02 green None
6 2022-05-03 blue 30
7 2022-05-03 green 40
8 2022-05-03 red 20