In a pandas DataFrame, I'd like to accomplish two clean-up steps:
- Drop any row with missing values; and
- Convert a date column from
DD.MM.YYYY
pattern to standardYYYY-MM-DD
I do know the answer for each step separately:
- dropping missing values can be achieved with
pandas.dropna()
- converting
DD.MM.YYYY
string toYYYY-MM-DD
can be done withpandas.to_datetime(x, format='%d.%m.%Y')
However, I'm not sure what would be the "standard" way of processing those two steps consecutively (aka "to chain the procedures").
I've seen this answer which is very on-topic, but too rudimentary.
Example
import numpy as np
import pandas as pd
name = ['John', 'Melinda', 'Greg', 'Amanda']
dob = ['20.12.2001', '11.03.1991', '31.12.1999', np.nan]
my_df = pd.DataFrame({'name':name,'dob':dob})
my_df
#> name dob
#> 0 John 20.12.2001
#> 1 Melinda 11.03.1991
#> 2 Greg 31.12.1999
#> 3 Amanda NaN
I'm attempting to write a concise code that will be something like:
# pseudo code
my_df.dropna().to_datetime('dob', format='%d.%m.%Y')
# expected output
#> name dob
#> 0 John 2001-12-20
#> 1 Melinda 1991-03-11
#> 2 Greg 1999-12-31
But I can't make it that simple! In any case, it seems that I must first assign the no-NaN dataframe to another variable. That is:
my_df_nona = my_df.dropna()
and then process that dataframe with to_datetime()
.
Second, I'm not sure how I'm supposed to make the assignment to my_df_nona
. Should I use copy()
?
Below are three versions of the same procedure. Each one gives the desired output, but with a different combination of warnings.
Option 1
- not using
.copy()
- using
.loc([:, 'dob'])
as suggested here -
my_df_nona_1 = my_df.dropna() my_df_nona_1.loc[:, 'dob'] = pd.to_datetime(my_df_nona_1.loc[:, 'dob'], format='%d.%m.%Y') #> SettingWithCopyWarning: #> A value is trying to be set on a copy of a slice from a DataFrame. #> Try using .loc[row_indexer,col_indexer] = value instead #> See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy #> FutureWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
Option 2
- using
.copy()
- using
.loc([:, 'dob'])
-
my_df_nona_2 = my_df.dropna().copy() my_df_nona_2.loc[:, 'dob'] = pd.to_datetime(my_df_nona_2.loc[:, 'dob'], format='%d.%m.%Y') #> FutureWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
Option 3
- using
.copy()
- not using
.loc([:, 'dob'])
-
my_df_nona_3 = my_df.dropna().copy() my_df_nona_3['dob'] = pd.to_datetime(my_df_nona_3['dob'], format='%d.%m.%Y')
Summary
As a beginner in pandas, I'd like to ask:
Is there a way to chain those procedures in a functional fashion?
I come from R where one could do# R my_df |> drop_na() |> mutate(across(dob, dmy))
I'm trying to understand whether I should attempt to mimic such syntax in pandas
If the answer to (1) is 'No', then what would be the best practice for a workflow that removes missing values, then converts the
dob
column type, then possibly additional data wrangling / aggregations over the data frame?
I've seen this detailed answer about SettingWithCopyWarning
. It is insightful. However, I'm not sure whether my take from it should be that using .copy()
is the panacea for all dataframe assignment issues. One limitation I can think of is that using .copy()
all over the place would be bloating memory.
CodePudding user response:
Use DataFrame.assign
for chain to_datetime
with DataFrame.dropna
:
df = my_df.dropna().assign(dob = lambda x: pd.to_datetime(x['dob'], format='%d.%m.%Y'))
print (df)
name dob
0 John 2001-12-20
1 Melinda 1991-03-11
2 Greg 1999-12-31