I am cleaning a pandas dataframe imported from a .csv. It has useful data in the first and second columns, then junk in columns 3-5. This pattern repeats where every 5th column starting from the first and second columns are useful, and every 5th column starting from the third through fifth are junk. I can remove the junk columns using the code below:
df1 = df.drop(columns=df.columns[4::5])
df1 = df1.drop(columns=df1.columns[3::4])
df1 = df1.drop(columns=df1.columns[2::3])
Is there a solution to do this all in one line?
CodePudding user response:
I think three lines is fine. The code won't get any clearer or faster from putting it all on one line.
Of course, you can always do:
columns = df.columns[:]
df1 = df.drop(columns=columns[4::5]).drop(columns=columns[3::5]).drop(columns=columns[2::5])
which I think also makes it clearer you intend to drop the fifth, fourth and third column every five columns.
CodePudding user response:
Boolean indexing the columns using numpy
could be useful
import numpy as np
# select 1st and 2nd columns of every 5 columns
df1.loc[:, np.isin(np.arange(df1.shape[1]) % 5, [0,1])]
CodePudding user response:
You may use np.r_
to concatenate indexes in an easy way:
>>> c = df.columns
>>> df.drop(columns=np.r_[c[2::5], c[3::5], c[4::5]])
CodePudding user response:
You can do
df1 = pd.concat([df.iloc[:, ::5], df.iloc[:, 1::5]], axis='columns')
That will change the column order, but with well-named columns, that shouldn't matter.