Hi i have a dataset in csv format. the issue with that data set is that it combine different csv files. and from the other csv files it also copied the column names. now i want to remove all column name which is in the middle of dataset
current csv file is like
col1 col2 col3
1 2 3
1 2 3
1 2 3
1 2 3
col1 col2 col3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
col1 col2 col3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
col1 col2 col3
col1 col2 col3
1 2 3
1 2 3
1 2 3
1 2 3
want to change it to this column name only on the top
col1 col2 col3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
CodePudding user response:
you can use this line of codes:
df = pd.read_csv(df_path)
# removing repeating headers
df = df[df.ne(df.columns).any(1)]
This solution compares each row with actual columns and works regardless of non-column rows being number or not
CodePudding user response:
Assuming that your col1
suppose to have integers, you can use:
df = df[df.apply(lambda r: r['col1'].isdigit(), axis=1)]
Or use your own test in lambda
for each row.
Also, this will leave the column types as object
, so you might want to use .astype(int)
(or something more specific for your case).
CodePudding user response:
First read the csv file into a dataframe using pd.read_csv(). Then drop duplicate rows by df = df[df.col1 != "col1"]
.
It will find all the rows which have col1 in them and drop these rows.