Home > Net >  PYTHON/PANDAS - Reindexing on multiple indexes
PYTHON/PANDAS - Reindexing on multiple indexes

Time:06-21

I have a dataframe similar to what follows:

test = {"id": ["A", "A", "A", "B", "B", "B"],
        "date":    ["09-02-2013", "09-03-2013", "09-05-2013", "09-15-2013", "09-17-2013", "09-18-2013"],
        "country": ["Poland", "Poland", "France", "Scotland", "Scotland", "Canada"]}

and I want a table which returns this :

id date country
A 09-02-2013 Poland
A 09-03-2013 Poland
A 09-04-2013 Poland
A 09-05-2013 France
B 09-15-2013 Scotland
B 09-16-2013 Scotland
B 09-17-2013 Scotland
B 09-18-2013 Canada

i.e. a table that fills in any date that I am missing but will only do it to the min/max of each id

I have looked around stack overflow but usually this problem just has one index or the person wants to drop an index anyway This is what I have got so far:

test_df = pd.DataFrame(test)

# get min date per id
dates = test_df.groupby("id")["date"].min().to_frame(name="min")

# get max date
dates["max"] = test_df.groupby("id")["date"].max().to_frame(name="max")

midx = pd.MultiIndex.from_frame(dates.apply(lambda x: pd.date_range(x["min"], x["max"], freq="D"), axis=1).explode().reset_index(name="date")[["date", "id"]])

test_df = test_df.set_index(["date", "id"])

test_df = test_df.reindex(midx).fillna(method="ffill")

test_df

Which gets me really close but not quite there, with the dates all there but no country:

id date country
A 09-02-2013 NaN
A 09-03-2013 NaN
A 09-04-2013 NaN
A 09-05-2013 NaN
B 09-15-2013 NaN
B 09-16-2013 NaN
B 09-17-2013 NaN
B 09-18-2013 NaN

Any ideas on how to fix it?

CodePudding user response:

IIUC, you could generate a date_range per group, explode, then merge and ffill the values per group:

out = (test_df
       .merge(pd
             .to_datetime(test_df['date'], dayfirst=False)
             .groupby(test_df['id'])
             .apply(lambda g: pd.date_range(g.min(), g.max(), freq='D'))
             .explode().dt.strftime('%m-%d-%Y')
             .reset_index(name='date'),
             how='right'
            )
       .assign(country=lambda d: d.groupby('id')['country'].ffill())
      )

output:

  id        date   country
0  A  09-02-2013    Poland
1  A  09-03-2013    Poland
2  A  09-04-2013    Poland
3  A  09-05-2013    France
4  B  09-15-2013  Scotland
5  B  09-16-2013  Scotland
6  B  09-17-2013  Scotland
7  B  09-18-2013    Canada
  • Related