Home > OS >  Pandas dataframe - update the values of certain rows based on the condition of a groupby object
Pandas dataframe - update the values of certain rows based on the condition of a groupby object

Time:09-20

With the following data:

idx_a = pd.date_range(start="2000-01-01 00:00:00", periods=5, freq="H")
idx_b = pd.date_range(start="2000-01-01 00:05:00", periods=5, freq="H")
idx_c = pd.date_range(start="2000-01-02 00:00:00", periods=5, freq="H")
idx_d = pd.date_range(start="2000-01-02 00:05:00", periods=5, freq="H")

df = pd.DataFrame({'article: ['a', 'b']*10 , 'view': range(1,21) }, index= idx_a.union(idx_b).union(idx_c).union(idx_d))
                     article    view
2000-01-01 00:00:00        a    1
2000-01-01 00:05:00        b    2
2000-01-01 01:00:00        a    3
2000-01-01 01:05:00        b    4
2000-01-01 02:00:00        a    5
2000-01-01 02:05:00        b    6
2000-01-01 03:00:00        a    7
2000-01-01 03:05:00        b    8
2000-01-01 04:00:00        a    9
2000-01-01 04:05:00        b    10
2000-01-02 00:00:00        a    11
2000-01-02 00:05:00        b    12
2000-01-02 01:00:00        a    13
2000-01-02 01:05:00        b    14
2000-01-02 02:00:00        a    15
2000-01-02 02:05:00        b    16
2000-01-02 03:00:00        a    17
2000-01-02 03:05:00        b    18
2000-01-02 04:00:00        a    19
2000-01-02 04:05:00        b    20

I'd like to update only 2am's view value with 3 am's view value for each article each day. So, the desired results should look like this ("<==" indicates rows updated):

                     article    view
2000-01-01 00:00:00        a    1
2000-01-01 00:05:00        b    2
2000-01-01 01:00:00        a    3
2000-01-01 01:05:00        b    4
2000-01-01 02:00:00        a    7   <===
2000-01-01 02:05:00        b    8   <===
2000-01-01 03:00:00        a    7
2000-01-01 03:05:00        b    8
2000-01-01 04:00:00        a    9
2000-01-01 04:05:00        b    10
2000-01-02 00:00:00        a    11
2000-01-02 00:05:00        b    12
2000-01-02 01:00:00        a    13
2000-01-02 01:05:00        b    14
2000-01-02 02:00:00        a    17  <===
2000-01-02 02:05:00        b    18  <===
2000-01-02 03:00:00        a    17
2000-01-02 03:05:00        b    18
2000-01-02 04:00:00        a    19
2000-01-02 04:05:00        b    20

After numerous attempts, I got the closest result with this code:

df.groupby([pd.Grouper(freq="D"), 'article']).view.transform(lambda s: s.where( ~(s.index.hour==2), s[s.index.hour==3]))

However, np.nan shows in the cells that I expected to be updated. Interestingly, if I replace s[s.index.hour==3] with an integer, the cells (i.e., 2am values) are correctly updated with the integer. How can I get the 3am's value of each article and use it to update 2am's value in a given day?

CodePudding user response:

If only difference is in hours is possible filter rows with replace hours and then use DataFrame.update:

df1 = df[df.index.hour==3].rename(lambda x: x.replace(hour=2))
print (df1)                   
                    article  view
2000-01-01 02:00:00       a     7
2000-01-01 02:05:00       b     8
2000-01-02 02:00:00       a    17
2000-01-02 02:05:00       b    18

df.update(df1)
print (df)
                    article  view
2000-01-01 00:00:00       a   1.0
2000-01-01 00:05:00       b   2.0
2000-01-01 01:00:00       a   3.0
2000-01-01 01:05:00       b   4.0
2000-01-01 02:00:00       a   7.0
2000-01-01 02:05:00       b   8.0
2000-01-01 03:00:00       a   7.0
2000-01-01 03:05:00       b   8.0
2000-01-01 04:00:00       a   9.0
2000-01-01 04:05:00       b  10.0
2000-01-02 00:00:00       a  11.0
2000-01-02 00:05:00       b  12.0
2000-01-02 01:00:00       a  13.0
2000-01-02 01:05:00       b  14.0
2000-01-02 02:00:00       a  17.0
2000-01-02 02:05:00       b  18.0
2000-01-02 03:00:00       a  17.0
2000-01-02 03:05:00       b  18.0
2000-01-02 04:00:00       a  19.0
2000-01-02 04:05:00       b  20.0

Your solution should be changed with convert values to list, also instead invert mask is use !=:

df['view1'] = (df.groupby([pd.Grouper(freq="D"), 'article']).view
                .transform(lambda s: s.where(s.index.hour!=2,s[s.index.hour==3].tolist())))
print (df)
                    article  view  view1
2000-01-01 00:00:00       a     1      1
2000-01-01 00:05:00       b     2      2
2000-01-01 01:00:00       a     3      3
2000-01-01 01:05:00       b     4      4
2000-01-01 02:00:00       a     5      7
2000-01-01 02:05:00       b     6      8
2000-01-01 03:00:00       a     7      7
2000-01-01 03:05:00       b     8      8
2000-01-01 04:00:00       a     9      9
2000-01-01 04:05:00       b    10     10
2000-01-02 00:00:00       a    11     11
2000-01-02 00:05:00       b    12     12
2000-01-02 01:00:00       a    13     13
2000-01-02 01:05:00       b    14     14
2000-01-02 02:00:00       a    15     17
2000-01-02 02:05:00       b    16     18
2000-01-02 03:00:00       a    17     17
2000-01-02 03:05:00       b    18     18
2000-01-02 04:00:00       a    19     19
2000-01-02 04:05:00       b    20     20
  • Related