I have several Excel files where several people send some information, but as people is too creative, they didn't preserve all the Excel template as instructed, so some files have 10 columns, others have 12, and so on.
There's a column that has only two possible values, "foo" and "bar", but it's not always in the same column, and the column aren't always named the same, so you can have "Thing" in column 12 and "THING (foo/bar)" in column 9
I'm writting a script to clean all this data (almost 200 files), and I need to get the foos and bars.
So my question is, is there a way to look in the dataframe for this patterns and tell me in which column are they?
Once I get in which column are the foos and bars, the rest of the problem is trivial, since the first column is always right, so I could use my extracted dataframe as df[[0, n]]
where n is where the foos and bars are
Currently what I'm doing is iterating over the column names like:
for i in df.columns:
if len(df.loc[(df[i]=='foo') | (df[i]=='bar')]) > 0:
print(f'Found {i}')
But I really don't like the solution, I guess there must be a better way
CodePudding user response:
You can test whether "foo"
and "bar"
are in particular column and if they are the only values which might be a bit cleaner:
for col in df.columns:
if df[col].isin(["foo","bar"]).all():
print(f'Found {col}')
For example, if we have the following df
:
a b c
0 1 foo foo
1 2 foo foo
2 3 bar bar
3 4 notbar bar
Then our code will return 'Found c'