I have the following code to count the number of times a value is higher than average and lower than average, then appends the value to a new column at the month stating whether it has more values higher than average or lower than average. However the code is quite an eyesore and I would like to know if there are any ways to simplify/beautify it.
from collections import Counter
df4 = df[['Date', 'dv']].copy()
df4.set_index(['Date'], inplace=True)
data = df4.to_period('M')
data.reset_index(inplace=True)
gb = data.groupby(['Date'])
result = gb['dv'].apply(list)
result = result.to_frame(name='dv2')
result.reset_index(inplace=True)
def mv(x):
keys = Counter(x).keys()
values = Counter(x).values()
keys = list(keys)
values = list(values)
if np.nan in keys:
return np.nan
else:
if len(keys) == 1:
return keys[0]
else:
if values[0] > values[1]:
return keys[0]
else:
return keys[1]
result['mv'] = result['dv2'].apply(lambda x: mv(x))
del result['dv2']
df4.reset_index(inplace=True)
data = df4[['Date', 'dv']].copy()
data.set_index('Date', inplace=True)
data_monthly = data.reset_index().groupby([data.index.year, data.index.month], as_index=False).last()
data_monthly['mv'] = data_monthly['Date'].dt.strftime('%Y-%m').map(result.set_index(result['Date'].dt.strftime('%Y-%m'))['mv'])
df4['mv'] = df['Date'].dt.strftime('%Y-%m-%D').map(data_monthly.set_index(data_monthly['Date'].dt.strftime('%Y-%m-%D'))['mv'])
This is the dataframe before:
Date dv
0 2010-01-02 NaN
1 2010-01-12 NaN
2 2010-01-23 NaN
3 2010-02-02 NaN
4 2010-02-13 NaN
5 2010-02-24 NaN
6 2010-03-03 Higher
7 2010-03-13 Higher
8 2010-03-24 Higher
9 2010-04-02 Higher
10 2010-04-14 Lower
11 2010-04-25 Higher
12 2010-05-05 Higher
13 2010-05-15 Lower
14 2010-05-26 Lower
15 2010-06-03 Lower
16 2010-06-14 Higher
17 2010-06-25 Higher
This is the dataframe after:
Date dv mv
0 2010-01-02 NaN NaN
1 2010-01-12 NaN NaN
2 2010-01-23 NaN NaN
3 2010-02-02 NaN NaN
4 2010-02-13 NaN NaN
5 2010-02-24 NaN NaN
6 2010-03-03 Higher NaN
7 2010-03-13 Higher NaN
8 2010-03-24 Higher Higher
9 2010-04-02 Higher NaN
10 2010-04-14 Lower NaN
11 2010-04-25 Higher Higher
12 2010-05-05 Higher NaN
13 2010-05-15 Lower NaN
14 2010-05-26 Lower Lower
15 2010-06-03 Lower NaN
16 2010-06-14 Higher NaN
17 2010-06-25 Higher Higher
CodePudding user response:
resample
andagg
to get the maximum date and most common value for each monthmerge
back to original dataframe andrename
df["Date"] = pd.to_datetime(df["Date"])
modes = df.resample("M", on="Date".agg({"Date": "max",
"dv": lambda x: x.mode()[0] if not x.mode().empty else np.nan}
).reset_index(drop=True))
output = df.merge(modes.rename(columns={"dv": "mv"}), how="left")
>>> output
Date dv mv
0 2010-01-02 NaN NaN
1 2010-01-12 NaN NaN
2 2010-01-23 NaN NaN
3 2010-02-02 NaN NaN
4 2010-02-13 NaN NaN
5 2010-02-24 NaN NaN
6 2010-03-03 Higher NaN
7 2010-03-13 Higher NaN
8 2010-03-24 Higher Higher
9 2010-04-02 Higher NaN
10 2010-04-14 Lower NaN
11 2010-04-25 Higher Higher
12 2010-05-05 Higher NaN
13 2010-05-15 Lower NaN
14 2010-05-26 Lower Lower
15 2010-06-03 Lower NaN
16 2010-06-14 Higher NaN
17 2010-06-25 Higher Higher
Note: pd.Series.mode
will return a list of the most common values if there is more than one. The code above keeps only one mode if it exists (January in your example has all NaN
and therefore no mode).