I am trying to group data according to three columns and then fill a value in one column using data from a subsequent row.
The data for the "test" dataframe is in the dictionary below.
{'NID': {110048: '217006453',
110049: '217006453',
110050: '217006453',
110051: '217006453',
110052: '217006453',
110053: '217006453'},
'Fdat': {110048: Timestamp('2019-02-16 00:00:00'),
110049: Timestamp('2019-02-16 00:00:00'),
110050: Timestamp('2019-02-16 00:00:00'),
110051: Timestamp('2019-02-16 00:00:00'),
110052: Timestamp('2019-02-16 00:00:00'),
110053: Timestamp('2019-02-16 00:00:00')},
'Code': {110048: 2, 110049: 2, 110050: 10, 110051: 2, 110052: 2, 110053: 10},
'Cdat': {110048: NaT,
110049: NaT,
110050: Timestamp('2019-07-01 00:00:00'),
110051: NaT,
110052: NaT,
110053: Timestamp('2020-01-30 00:00:00')},
'Abdat': {110048: NaT,
110049: NaT,
110050: NaT,
110051: NaT,
110052: NaT,
110053: Timestamp('2020-01-30 00:00:00')}}
Formatted it looks like the following
NID Fdat Code Cdat Abdat
110048 217006453 2019-02-16 2 NaT NaT
110049 217006453 2019-02-16 2 NaT NaT
110050 217006453 2019-02-16 10 2019-07-01 NaT
110051 217006453 2019-02-16 2 NaT NaT
110052 217006453 2019-02-16 2 NaT NaT
110053 217006453 2019-02-16 10 2020-01-30 2020-01-30
I would like to allocate the Abdat to the row with the Cdat that is not NaT for the same animal and Fdat. Then delete the current row that contains the Abdat
The desired output is
NID Fdat Code Cdat Abdat
110048 217006453 2019-02-16 2 NaT NaT
110049 217006453 2019-02-16 2 NaT NaT
110050 217006453 2019-02-16 10 2019-07-01 2020-01-30
110051 217006453 2019-02-16 2 NaT NaT
110052 217006453 2019-02-16 2 NaT NaT
To do this I have tried the following line of code
test.loc["Abdat"] = test.groupby(['NID', 'Fdat', 'Code'])['Abdat'].shift(-1)
When I run this code I am getting the following unexpected result? When I run the same code in another chunk it behaves differently however I get the following warning.
C:\Users\jhou2636\Anaconda3\lib\site-packages\pandas\core\indexing.py:2107: FutureWarning:
In a future version, the Index constructor will not infer numeric dtypes when passed object-dtype sequences (matching Series behavior)
When I run the code below
test = repro.loc[repro['NID']=='217006453', ["NID", "Fdat", 'Code', 'Cdat','Abdat']]
test.loc["Abdat"] = test.groupby(['NID', 'Fdat', 'Code'])['Abdat'].shift(-1)
test.head(30)
I get the following output
NID Fdat Code Cdat Abdat
110048 217006453 2019-02-16 2 NaT NaT
110049 217006453 2019-02-16 2 NaT NaT
110050 217006453 2019-02-16 10 2019-07-01 NaT
110051 217006453 2019-02-16 2 NaT NaT
110052 217006453 2019-02-16 2 NaT NaT
110053 217006453 2019-02-16 10 2020-01-30 2020-01-30
Abdat NaT NaT NaT NaT NaT
I am confused by the addition of Abdat to the index and the generation of missing values in all columns.
Help appreciated
CodePudding user response:
The shifting of the value is easy to fix, but the deleting old value part is not trivial and I think needs more specification from you on the actual rule
In your solution the problem is that test.loc["Abdat"]
references rows labelled by "Abdat" not the column (that's why the show up in the index). So you can simply do
test['Abdat'] = test.groupby(['NID', 'Fdat', 'Code'])['Abdat'].shift(-1)
This creates
NID Fdat Code Cdat Abdat
110048 217006453 2019-02-16 2 NaT NaT
110049 217006453 2019-02-16 2 NaT NaT
110050 217006453 2019-02-16 10 2019-07-01 2020-01-30
110051 217006453 2019-02-16 2 NaT NaT
110052 217006453 2019-02-16 2 NaT NaT
110053 217006453 2019-02-16 10 2020-01-30 NaT
However it is not clear how to "delete" the row you want -- I think you need to specify more precise rules. For example, we can mark the rows that got filled by the default value in the shift by explicitly specifying the default value which is not NaT
:
fill_timestamp = pd.Timestamp('1970-01-01 00:00:00')
test['Abdat'] = test.groupby(['NID', 'Fdat', 'Code'])['Abdat'].shift(-1, fill_value = fill_timestamp)
the output would be
NID Fdat Code Cdat Abdat
110048 217006453 2019-02-16 2 NaT NaT
110049 217006453 2019-02-16 2 NaT NaT
110050 217006453 2019-02-16 10 2019-07-01 2020-01-30
110051 217006453 2019-02-16 2 NaT NaT
110052 217006453 2019-02-16 2 NaT 1970-01-01
110053 217006453 2019-02-16 10 2020-01-30 1970-01-01
Now we can get rid of the rows with the fill_timestamp
in 'Abdat':
test[test['Abdat'] != fill_timestamp]
but that will also delete the one for Code = 2:
NID Fdat Code Cdat Abdat
110048 217006453 2019-02-16 2 NaT NaT
110049 217006453 2019-02-16 2 NaT NaT
110050 217006453 2019-02-16 10 2019-07-01 2020-01-30
110051 217006453 2019-02-16 2 NaT NaT