Home > Software design >  Filtering dataframe based on string formatting in pandas
Filtering dataframe based on string formatting in pandas

Time:09-27

I need to filter my dataframe based on dates, formatted as "%d-%m-%Y". I don't know which columns meet this criteria, there might be more than one column meeting this condition. How can I tackle this problem?

import pandas as pd

df = pd.DataFrame(
    {
        "Name": ['A', 'B', 'C'],
        "Training-1 (Realized)": [0, 0, 0],
        "Training-2 (Realized)": ["5/9/2022", "TAKEN", 0],
        "Training-3 (Realized)": ["TAKEN", "TAKEN", 0],
        "Training-4 (Realized)": ["11/9/2022", 0, 0]
    }
)

  Name  Training-1 (Realized) Training-2 (Realized) Training-3 (Realized) Training-4 (Realized)
0    A                      0              5/9/2022                 TAKEN             11/9/2022
1    B                      0                 TAKEN                 TAKEN                     0
2    C                      0                     0                     0                     0

My expected output:

 Training-2 (Realized) Training-4 (Realized)
0              5/9/2022             11/9/2022

I tried using .contains() method but it didn't work for me. I kept getting "ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()."

CodePudding user response:

If you want to get valid dates, you can try to convert to_datetime and only keep the rows/columns with non-NA values:

df2 = df.apply(pd.to_datetime, format='%d/%m/%Y', errors='coerce')

mask = df2.notna()

out = df2.loc[mask.any(axis=1), mask.any()]

output:

  Training-2 (Realized) Training-4 (Realized)
0            2022-09-05            2022-09-11

CodePudding user response:

Try as follows:

out = df.stack().str.extract(r'^(\d{1,2}\/\d{1,2}\/\d{4})$').dropna().unstack().\
    droplevel(0, axis=1)
print(out)

  Training-2 (Realized) Training-4 (Realized)
0              5/9/2022             11/9/2022

Suppose we also add a date value to column Training-3 (Realized), but in a different row, then we get this:

df.iloc[2,3] = '12/9/2022'
print(out)

  Training-2 (Realized) Training-4 (Realized) Training-3 (Realized)
0              5/9/2022             11/9/2022                   NaN
2                   NaN                   NaN             12/9/2022

Explanation

  • First, we use df.stack to get all values as a pd.Series with index/columns as pd.MultiIndex.
  • Now, use Series.str.extract with a regex pattern to find all values that look like dates. N.B. technically, the pattern used will also extract something like "40/1/3000"; if this is a problem, make the pattern more precise.
  • Finally, we use Series.dropna to keep only the values that have been extracted, and we unstack again. Use df.droplevel on column level 0 to get rid of the 0 (i.e. the "name" for the column with extracted values).
  • Related