Consider this simple example
import pandas as pd
df = pd.DataFrame({'good_one' : [1,2,3],
'bad_one' : [1,2,3]})
Out[7]:
good_one bad_one
0 1 1
1 2 2
2 3 3
In this artificial example I would like to filter the columns that DO NOT start with bad
. I can use a regex condition on the pandas columns using .filter()
. However, I am not able to make it work with a negative lookbehind.
See here
df.filter(regex = 'one')
Out[8]:
good_one bad_one
0 1 1
1 2 2
2 3 3
but now
df.filter(regex = '(?<!bad).*')
Out[9]:
good_one bad_one
0 1 1
1 2 2
2 3 3
does not do anything. Am I missing something?
Thanks
CodePudding user response:
Solution if need remove columns names starting by bad
:
df = pd.DataFrame({'good_one' : [1,2,3],
'not_bad_one' : [1,2,3],
'bad_one' : [1,2,3]})
#https://stackoverflow.com/a/5334825/2901002
df1 = df.filter(regex=r'^(?!bad).*$')
print (df1)
good_one not_bad_one
0 1 1
1 2 2
2 3 3
^
asserts position at start of a line
Negative Lookahead (?!bad
)
Assert that the Regex below does not match bad matches
.
matches any character
*
matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy)
$
asserts position at the end of a line
Solution for remove all columns with bad
substring:
df2 = df.filter(regex=r'^(?!.*bad).*$')
print (df2)
good_one
0 1
1 2
2 3
^
asserts position at start of a line
Negative Lookahead (?!.*bad
)
Assert that the Regex below does not match
.
matches any character
bad matches the characters bad literally
.
matches any character
*
matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy)
$
asserts position at the end of a line