Home > Blockchain >  How to consecutively chain `dropna()` and `to_datetime()` in pandas, accounting for `SettingWithCopy
How to consecutively chain `dropna()` and `to_datetime()` in pandas, accounting for `SettingWithCopy

Time:11-17

In a pandas DataFrame, I'd like to accomplish two clean-up steps:

  1. Drop any row with missing values; and
  2. Convert a date column from DD.MM.YYYY pattern to standard YYYY-MM-DD

I do know the answer for each step separately:

  • dropping missing values can be achieved with pandas.dropna()
  • converting DD.MM.YYYY string to YYYY-MM-DD can be done with pandas.to_datetime(x, format='%d.%m.%Y')

However, I'm not sure what would be the "standard" way of processing those two steps consecutively (aka "to chain the procedures").

I've seen this answer which is very on-topic, but too rudimentary.

Example

import numpy as np
import pandas as pd

name = ['John', 'Melinda', 'Greg', 'Amanda']
dob = ['20.12.2001', '11.03.1991', '31.12.1999', np.nan]
my_df = pd.DataFrame({'name':name,'dob':dob})
my_df
#>       name         dob
#> 0     John  20.12.2001
#> 1  Melinda  11.03.1991
#> 2     Greg  31.12.1999
#> 3   Amanda         NaN

I'm attempting to write a concise code that will be something like:

# pseudo code
my_df.dropna().to_datetime('dob', format='%d.%m.%Y')

# expected output
#>       name        dob
#> 0     John 2001-12-20
#> 1  Melinda 1991-03-11
#> 2     Greg 1999-12-31

But I can't make it that simple! In any case, it seems that I must first assign the no-NaN dataframe to another variable. That is:

my_df_nona = my_df.dropna()

and then process that dataframe with to_datetime().

Second, I'm not sure how I'm supposed to make the assignment to my_df_nona. Should I use copy()?

Below are three versions of the same procedure. Each one gives the desired output, but with a different combination of warnings.

Option 1

  • not using .copy()
  • using .loc([:, 'dob']) as suggested here
  • my_df_nona_1 = my_df.dropna()
    my_df_nona_1.loc[:, 'dob'] = pd.to_datetime(my_df_nona_1.loc[:, 'dob'], format='%d.%m.%Y')
    
    #> SettingWithCopyWarning: 
    #> A value is trying to be set on a copy of a slice from a DataFrame.
    #> Try using .loc[row_indexer,col_indexer] = value instead
    #> See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
    #> FutureWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
    

Option 2

  • using .copy()
  • using .loc([:, 'dob'])
  • my_df_nona_2 = my_df.dropna().copy()
    my_df_nona_2.loc[:, 'dob'] = pd.to_datetime(my_df_nona_2.loc[:, 'dob'], format='%d.%m.%Y')
    
    #> FutureWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
    

Option 3

  • using .copy()
  • not using .loc([:, 'dob'])
  • my_df_nona_3 = my_df.dropna().copy()
    my_df_nona_3['dob'] = pd.to_datetime(my_df_nona_3['dob'], format='%d.%m.%Y')
    

Summary

As a beginner in pandas, I'd like to ask:

  1. Is there a way to chain those procedures in a functional fashion?
    I come from R where one could do

    # R
    my_df |> drop_na() |> mutate(across(dob, dmy))
    

    I'm trying to understand whether I should attempt to mimic such syntax in pandas

  2. If the answer to (1) is 'No', then what would be the best practice for a workflow that removes missing values, then converts the dob column type, then possibly additional data wrangling / aggregations over the data frame?


I've seen this detailed answer about SettingWithCopyWarning. It is insightful. However, I'm not sure whether my take from it should be that using .copy() is the panacea for all dataframe assignment issues. One limitation I can think of is that using .copy() all over the place would be bloating memory.

CodePudding user response:

Use DataFrame.assign for chain to_datetime with DataFrame.dropna:

df = my_df.dropna().assign(dob = lambda x: pd.to_datetime(x['dob'], format='%d.%m.%Y'))
print (df)
      name        dob
0     John 2001-12-20
1  Melinda 1991-03-11
2     Greg 1999-12-31
  • Related