My dataframe looks like that:
1 07.06.2022 06:55 <NA> <NA> 3 <NA> <NA> 5 5 <NA> <NA> <NA> 6 <NA> <NA>
2 07.06.2022 08:20 <NA> 6 <NA> <NA> 6 <NA> <NA> <NA> <NA> <NA> 1 <NA> <NA>
3 07.06.2022 13:00 <NA> <NA> 2 <NA> 4 6 5 1 5 <NA> 2 <NA> <NA>
4 08.06.2022 06:40 <NA> <NA> 3 <NA> <NA> 1 4 <NA> <NA> <NA> 2 <NA> <NA>
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
79 20.07.2022 08:10 <NA> 2 <NA> <NA> 5 <NA> <NA> <NA> <NA> 6.0 4 3 <NA>
80 20.07.2022 10:15 <NA> 2 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 5 <NA> 3
81 20.07.2022 13:00 <NA> 2 6 5 5 <NA> 1-2 <NA> 6 3.0 3 5 1-1
82 20.07.2022 14:20 <NA> 2 <NA> <NA> <NA> <NA> <NA> <NA> <NA> 6.0 1 5 <NA>
83 21.07.2022 07:00 1 <NA> 6 <NA> <NA> 4 5 <NA> <NA> <NA> 3 <NA> <NA>
The first two columns are fix but after that there is no fix number of columns. Sometimes a column of these additional columns contain invalid characters (NaN is valid!).
I try to delete those rows. But I don't know how.
I found the following solutions - but none of them matches my case:
df[df["col"].str.contains("this string")==False]
In that case I need to know which column contains that string (or does not contain). But I've no idea how many columns are available. I can iterate over them all but it feels not the right way.
But contains
might work with regex.
df[df.isin(...)
This needs specific items that are contained. But I don't have a list of all valid possibilities.
df.eq(r'\d([-]\d)*')
I tried to do something like:
df = df[df.neq(0).any(1)]
for regexes but it's not supported.
Merging all columns to a new one
...and use the first approach to delete the rows. It is pretty easy to combine a fix number of columns:
df['new-col'] = df['col1'].str df['col2'].str
I tried
df['new-col'] = ''.join(df.columns[2:])
but that doesn't work.
What I have
The regex that should match for each columns[2:]
is: \d([-]\d)*
(or the value should be NaN)
Are there any Ideas to delete rows where any column match (or does not match - inverting the regex is pretty easy) a regex?
sample-data as CSV
Datum,Zeit,User1,User2,User3,User4,User5,User6,User7,User8,User9,User10,User11,User12,User13,User14
03.06.2022,10:45,,,,,,6,,5,,6,,,,
07.06.2022,06:55,,,3,,,5,5,,,,6,,,
07.06.2022,08:20,,6,,,6,,,,,,1,,,
07.06.2022,13:00,,,2,,4,6,5,1,5,,2,,,
08.06.2022,06:40,,,3,,,x,4,,,,2,,,
08.06.2022,12:55,,,3,5,,5,6,5,4,,4,,,
09.06.2022,07:00,4,,4,,,2,,,,,x,,,
09.06.2022,09:52,,2-4,,2-6,,2-4,,,,,2-1,,,
09.06.2022,13:00,,4,,2,6,1-2,,1-3,3,,1-5,3,,
10.06.2022,07:00,2,,3,,,3,,,5,,6,,,
10.06.2022,09:50,,4,,,,,,,,,1,,,
10.06.2022,11:00,,6,,,,2,,,,,,,,
13.06.2022,6:30,1,,,,,4,6,,,,,,,
13.06.2022,10:00,,6,,1,,,,,4,,,,,
13.06.2022,13:00,,6,,,,1,4,,5,,,,2,
14.06.2022,8:00,5,,,,,5,4,,,,,,,
15.06.2022,8:00,1,,,,,4,2,,3,,,,,
This running sample contains all the code as string. But I try to read an xlsx-file containing the data an I don't want to do some tricks on the source data. A solution might be to export the xlsx to csv, load them into memory as string, run a regex to filter all lines and after that put it into pandas <- this sounds weird.
CodePudding user response:
You want to delete rows that contain 'x' in columns other that the first two.
Use boolean indexing:
df[~df.iloc[:, 2:].apply(lambda c: c.astype(str).str.contains('x')).any(1)]
NB. for an exact match use df.iloc[:, 2:].eq('x')
in place of df.iloc[:, 2:].apply(lambda c: c.astype(str).str.contains('x'))
output:
Datum Zeit User1 User2 User3 User4 User5 User6 User7 User8 User9 User10 User11 User12 User13 User14
0 03.06.2022 10:45 NaN NaN NaN NaN NaN 6 NaN 5 NaN 6.0 NaN NaN NaN NaN
1 07.06.2022 06:55 NaN NaN 3.0 NaN NaN 5 5.0 NaN NaN NaN 6 NaN NaN NaN
2 07.06.2022 08:20 NaN 6 NaN NaN 6.0 NaN NaN NaN NaN NaN 1 NaN NaN NaN
3 07.06.2022 13:00 NaN NaN 2.0 NaN 4.0 6 5.0 1 5.0 NaN 2 NaN NaN NaN
5 08.06.2022 12:55 NaN NaN 3.0 5 NaN 5 6.0 5 4.0 NaN 4 NaN NaN NaN
7 09.06.2022 09:52 NaN 2-4 NaN 2-6 NaN 2-4 NaN NaN NaN NaN 2-1 NaN NaN NaN
8 09.06.2022 13:00 NaN 4 NaN 2 6.0 1-2 NaN 1-3 3.0 NaN 1-5 3.0 NaN NaN
9 10.06.2022 07:00 2.0 NaN 3.0 NaN NaN 3 NaN NaN 5.0 NaN 6 NaN NaN NaN
10 10.06.2022 09:50 NaN 4 NaN NaN NaN NaN NaN NaN NaN NaN 1 NaN NaN NaN
11 10.06.2022 11:00 NaN 6 NaN NaN NaN 2 NaN NaN NaN NaN NaN NaN NaN NaN
12 13.06.2022 6:30 1.0 NaN NaN NaN NaN 4 6.0 NaN NaN NaN NaN NaN NaN NaN
13 13.06.2022 10:00 NaN 6 NaN 1 NaN NaN NaN NaN 4.0 NaN NaN NaN NaN NaN
14 13.06.2022 13:00 NaN 6 NaN NaN NaN 1 4.0 NaN 5.0 NaN NaN NaN 2.0 NaN
15 14.06.2022 8:00 5.0 NaN NaN NaN NaN 5 4.0 NaN NaN NaN NaN NaN NaN NaN
16 15.06.2022 8:00 1.0 NaN NaN NaN NaN 4 2.0 NaN 3.0 NaN NaN NaN NaN NaN
How it works:
# invert condition
~
# select all but first 2 columns
df.iloc[:, 2:]
# check if the values contain "x"
.apply(lambda c: c.astype(str).str.contains('x'))
# True is any value in row is True
.any(1)
updated answer
keep rows with all among (NAs, digits, regex), but not anything other than that:
# which cells contain ^\d(?:[-]\d)*$ ?
m1 = df.iloc[:, 2:].apply(lambda c: c.astype(str).str.contains('^\d(?:[-]\d)*$'))
# which cells are NA?
m2 = df.iloc[:, 2:].isna()
# which cells are numeric?
m3 = df.iloc[:, 2:].apply(pd.to_numeric, errors='coerce').notna()
# do ALL cells in a row match at least one of above?
mask = (m1|m2|m3).all(1)
# if True, keep the row
out = df[mask]