I am trying to do a filter
on my Pandas data frame that contains columns with names such as DE-NL
, DE-FR
, FR-NL
, AT-DE
etc.
I need to filter my data frame so I only get the columns that contains two of a certain strings.
So basically, if I just do this:
df = df.filter(regex="DE|NL|FR")
Then it finds all columns where either of the three are in. So I might also get something like AT-DE
. In this case I would actually only need the following columns:
DE-NL, NL-DE, NL-FR, FR-NL, DE-FR, FR-DE
So two of the filter strings needs to be present before it's a match.
CodePudding user response:
Use:
df = df.filter(regex="^(?:DE|NL|FR)-(?:DE|NL|FR)$")
To make it programmatically from a list of country codes:
l = ['DE', 'NL', 'FR']
s = '|'.join(l)
df = df.filter(regex=f"^(?:{s})-(?:{s})$")
If you want to avoid matching repeated values (e.g., 'NL-NL'):
l = ['DE', 'NL', 'FR']
s = '|'.join(l)
df = df.filter(regex=fr"^({s})-(?!\1)(?:{s})$")