Home > OS >  Simplifying counting operation in pandas dataframe
Simplifying counting operation in pandas dataframe

Time:02-17

I have the following code to count the number of times a value is higher than average and lower than average, then appends the value to a new column at the month stating whether it has more values higher than average or lower than average. However the code is quite an eyesore and I would like to know if there are any ways to simplify/beautify it.

from collections import Counter

df4 = df[['Date', 'dv']].copy()
df4.set_index(['Date'], inplace=True)
data = df4.to_period('M')
data.reset_index(inplace=True)

gb = data.groupby(['Date'])
result = gb['dv'].apply(list)
result = result.to_frame(name='dv2')
result.reset_index(inplace=True)


def mv(x):
    keys = Counter(x).keys()
    values = Counter(x).values()

    keys = list(keys)
    values = list(values)

    if np.nan in keys:
        return np.nan
    else:
        if len(keys) == 1:
            return keys[0]
        else:
            if values[0] > values[1]:
                return keys[0]
            else:
                return keys[1]


result['mv'] = result['dv2'].apply(lambda x: mv(x))
del result['dv2']

df4.reset_index(inplace=True)
data = df4[['Date', 'dv']].copy()
data.set_index('Date', inplace=True)
data_monthly = data.reset_index().groupby([data.index.year, data.index.month], as_index=False).last()

data_monthly['mv'] = data_monthly['Date'].dt.strftime('%Y-%m').map(result.set_index(result['Date'].dt.strftime('%Y-%m'))['mv'])
df4['mv'] = df['Date'].dt.strftime('%Y-%m-%D').map(data_monthly.set_index(data_monthly['Date'].dt.strftime('%Y-%m-%D'))['mv'])

This is the dataframe before:

         Date      dv
0  2010-01-02     NaN
1  2010-01-12     NaN
2  2010-01-23     NaN
3  2010-02-02     NaN
4  2010-02-13     NaN
5  2010-02-24     NaN
6  2010-03-03  Higher
7  2010-03-13  Higher
8  2010-03-24  Higher
9  2010-04-02  Higher
10 2010-04-14   Lower
11 2010-04-25  Higher
12 2010-05-05  Higher
13 2010-05-15   Lower
14 2010-05-26   Lower
15 2010-06-03   Lower
16 2010-06-14  Higher
17 2010-06-25  Higher

This is the dataframe after:

         Date      dv      mv
0  2010-01-02     NaN     NaN
1  2010-01-12     NaN     NaN
2  2010-01-23     NaN     NaN
3  2010-02-02     NaN     NaN
4  2010-02-13     NaN     NaN
5  2010-02-24     NaN     NaN
6  2010-03-03  Higher     NaN
7  2010-03-13  Higher     NaN
8  2010-03-24  Higher  Higher
9  2010-04-02  Higher     NaN
10 2010-04-14   Lower     NaN
11 2010-04-25  Higher  Higher
12 2010-05-05  Higher     NaN
13 2010-05-15   Lower     NaN
14 2010-05-26   Lower   Lower
15 2010-06-03   Lower     NaN
16 2010-06-14  Higher     NaN
17 2010-06-25  Higher  Higher

CodePudding user response:

  1. resample and agg to get the maximum date and most common value for each month
  2. merge back to original dataframe and rename
df["Date"] = pd.to_datetime(df["Date"])
modes = df.resample("M", on="Date".agg({"Date": "max",
                                        "dv": lambda x: x.mode()[0] if not x.mode().empty else np.nan}
                                       ).reset_index(drop=True))
output = df.merge(modes.rename(columns={"dv": "mv"}), how="left")

>>> output
         Date      dv      mv
0  2010-01-02     NaN     NaN
1  2010-01-12     NaN     NaN
2  2010-01-23     NaN     NaN
3  2010-02-02     NaN     NaN
4  2010-02-13     NaN     NaN
5  2010-02-24     NaN     NaN
6  2010-03-03  Higher     NaN
7  2010-03-13  Higher     NaN
8  2010-03-24  Higher  Higher
9  2010-04-02  Higher     NaN
10 2010-04-14   Lower     NaN
11 2010-04-25  Higher  Higher
12 2010-05-05  Higher     NaN
13 2010-05-15   Lower     NaN
14 2010-05-26   Lower   Lower
15 2010-06-03   Lower     NaN
16 2010-06-14  Higher     NaN
17 2010-06-25  Higher  Higher

Note: pd.Series.mode will return a list of the most common values if there is more than one. The code above keeps only one mode if it exists (January in your example has all NaN and therefore no mode).

  • Related