Home > Back-end >  Using Groupby and shift -1 to fill column value - Unexpected outcome
Using Groupby and shift -1 to fill column value - Unexpected outcome

Time:02-27

I am trying to group data according to three columns and then fill a value in one column using data from a subsequent row.

The data for the "test" dataframe is in the dictionary below.

{'NID': {110048: '217006453',
  110049: '217006453',
  110050: '217006453',
  110051: '217006453',
  110052: '217006453',
  110053: '217006453'},
 'Fdat': {110048: Timestamp('2019-02-16 00:00:00'),
  110049: Timestamp('2019-02-16 00:00:00'),
  110050: Timestamp('2019-02-16 00:00:00'),
  110051: Timestamp('2019-02-16 00:00:00'),
  110052: Timestamp('2019-02-16 00:00:00'),
  110053: Timestamp('2019-02-16 00:00:00')},
 'Code': {110048: 2, 110049: 2, 110050: 10, 110051: 2, 110052: 2, 110053: 10},
 'Cdat': {110048: NaT,
  110049: NaT,
  110050: Timestamp('2019-07-01 00:00:00'),
  110051: NaT,
  110052: NaT,
  110053: Timestamp('2020-01-30 00:00:00')},
 'Abdat': {110048: NaT,
  110049: NaT,
  110050: NaT,
  110051: NaT,
  110052: NaT,
  110053: Timestamp('2020-01-30 00:00:00')}}

Formatted it looks like the following

          NID           Fdat   Code  Cdat      Abdat
110048  217006453   2019-02-16  2   NaT         NaT
110049  217006453   2019-02-16  2   NaT         NaT
110050  217006453   2019-02-16  10  2019-07-01  NaT
110051  217006453   2019-02-16  2   NaT         NaT
110052  217006453   2019-02-16  2   NaT         NaT
110053  217006453   2019-02-16  10  2020-01-30  2020-01-30

I would like to allocate the Abdat to the row with the Cdat that is not NaT for the same animal and Fdat. Then delete the current row that contains the Abdat

The desired output is

          NID           Fdat   Code  Cdat      Abdat
110048  217006453   2019-02-16  2   NaT         NaT
110049  217006453   2019-02-16  2   NaT         NaT
110050  217006453   2019-02-16  10  2019-07-01  2020-01-30
110051  217006453   2019-02-16  2   NaT         NaT
110052  217006453   2019-02-16  2   NaT         NaT

To do this I have tried the following line of code

test.loc["Abdat"] = test.groupby(['NID', 'Fdat', 'Code'])['Abdat'].shift(-1)

When I run this code I am getting the following unexpected result? When I run the same code in another chunk it behaves differently however I get the following warning.

C:\Users\jhou2636\Anaconda3\lib\site-packages\pandas\core\indexing.py:2107: FutureWarning:

In a future version, the Index constructor will not infer numeric dtypes when passed object-dtype sequences (matching Series behavior)

When I run the code below

test = repro.loc[repro['NID']=='217006453', ["NID", "Fdat", 'Code', 'Cdat','Abdat']]

test.loc["Abdat"] = test.groupby(['NID', 'Fdat', 'Code'])['Abdat'].shift(-1)

test.head(30)

I get the following output


    NID Fdat    Code    Cdat    Abdat
110048  217006453   2019-02-16  2   NaT          NaT
110049  217006453   2019-02-16  2   NaT          NaT
110050  217006453   2019-02-16  10  2019-07-01   NaT
110051  217006453   2019-02-16  2   NaT          NaT
110052  217006453   2019-02-16  2   NaT          NaT
110053  217006453   2019-02-16  10  2020-01-30  2020-01-30
Abdat     NaT           NaT    NaT   NaT         NaT

I am confused by the addition of Abdat to the index and the generation of missing values in all columns.

Help appreciated

CodePudding user response:

The shifting of the value is easy to fix, but the deleting old value part is not trivial and I think needs more specification from you on the actual rule

In your solution the problem is that test.loc["Abdat"] references rows labelled by "Abdat" not the column (that's why the show up in the index). So you can simply do

test['Abdat'] = test.groupby(['NID', 'Fdat', 'Code'])['Abdat'].shift(-1)

This creates


        NID         Fdat        Code    Cdat    Abdat
110048  217006453   2019-02-16  2       NaT NaT
110049  217006453   2019-02-16  2       NaT NaT
110050  217006453   2019-02-16  10      2019-07-01  2020-01-30
110051  217006453   2019-02-16  2       NaT NaT
110052  217006453   2019-02-16  2       NaT NaT
110053  217006453   2019-02-16  10      2020-01-30  NaT

However it is not clear how to "delete" the row you want -- I think you need to specify more precise rules. For example, we can mark the rows that got filled by the default value in the shift by explicitly specifying the default value which is not NaT:

fill_timestamp = pd.Timestamp('1970-01-01 00:00:00')
test['Abdat'] = test.groupby(['NID', 'Fdat', 'Code'])['Abdat'].shift(-1, fill_value = fill_timestamp)

the output would be

    NID Fdat    Code    Cdat    Abdat
110048  217006453   2019-02-16  2   NaT NaT
110049  217006453   2019-02-16  2   NaT NaT
110050  217006453   2019-02-16  10  2019-07-01  2020-01-30
110051  217006453   2019-02-16  2   NaT NaT
110052  217006453   2019-02-16  2   NaT 1970-01-01
110053  217006453   2019-02-16  10  2020-01-30  1970-01-01

Now we can get rid of the rows with the fill_timestamp in 'Abdat':

test[test['Abdat'] != fill_timestamp]

but that will also delete the one for Code = 2:

    NID Fdat    Code    Cdat    Abdat
110048  217006453   2019-02-16  2   NaT NaT
110049  217006453   2019-02-16  2   NaT NaT
110050  217006453   2019-02-16  10  2019-07-01  2020-01-30
110051  217006453   2019-02-16  2   NaT NaT
  • Related