I need to filter my dataframe based on dates, formatted as "%d-%m-%Y". I don't know which columns meet this criteria, there might be more than one column meeting this condition. How can I tackle this problem?
import pandas as pd
df = pd.DataFrame(
{
"Name": ['A', 'B', 'C'],
"Training-1 (Realized)": [0, 0, 0],
"Training-2 (Realized)": ["5/9/2022", "TAKEN", 0],
"Training-3 (Realized)": ["TAKEN", "TAKEN", 0],
"Training-4 (Realized)": ["11/9/2022", 0, 0]
}
)
Name Training-1 (Realized) Training-2 (Realized) Training-3 (Realized) Training-4 (Realized)
0 A 0 5/9/2022 TAKEN 11/9/2022
1 B 0 TAKEN TAKEN 0
2 C 0 0 0 0
My expected output:
Training-2 (Realized) Training-4 (Realized)
0 5/9/2022 11/9/2022
I tried using .contains() method but it didn't work for me. I kept getting "ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()."
CodePudding user response:
If you want to get valid dates, you can try to convert to_datetime
and only keep the rows/columns with non-NA values:
df2 = df.apply(pd.to_datetime, format='%d/%m/%Y', errors='coerce')
mask = df2.notna()
out = df2.loc[mask.any(axis=1), mask.any()]
output:
Training-2 (Realized) Training-4 (Realized)
0 2022-09-05 2022-09-11
CodePudding user response:
Try as follows:
out = df.stack().str.extract(r'^(\d{1,2}\/\d{1,2}\/\d{4})$').dropna().unstack().\
droplevel(0, axis=1)
print(out)
Training-2 (Realized) Training-4 (Realized)
0 5/9/2022 11/9/2022
Suppose we also add a date value to column Training-3 (Realized)
, but in a different row, then we get this:
df.iloc[2,3] = '12/9/2022'
print(out)
Training-2 (Realized) Training-4 (Realized) Training-3 (Realized)
0 5/9/2022 11/9/2022 NaN
2 NaN NaN 12/9/2022
Explanation
- First, we use
df.stack
to get all values as apd.Series
with index/columns aspd.MultiIndex
. - Now, use
Series.str.extract
with a regex pattern to find all values that look like dates. N.B. technically, the pattern used will also extract something like "40/1/3000"; if this is a problem, make the pattern more precise. - Finally, we use
Series.dropna
to keep only the values that have been extracted, and weunstack
again. Usedf.droplevel
oncolumn level 0
to get rid of the0
(i.e. the "name" for the column with extracted values).