im running a code to extract pdf tables to separate csv.i have done that using tabula py..now im getting each table but with some errors like unnamed:0,unnamed:1,..so on as few column names.i need to clean the csv tables now..
the output which i get:
Unnamed: 0,id,Unnamed: 1,Unnamed: 2,Length,Unnamed: 3,Typ,Name,Val ,Com 1 ,5 ,3 ,Company ,Assign,fan , , , ,
the desired output which i want:
id ,Length ,Types ,Name ,Value ,Comments 1 ,5 ,3 ,Company ,Assigned ,fan
like this i get tables tables help me get the correct one. how shud i code my df to look this way...if the unnamed is present in column name ..the unnamed should be replaced with the immediate next desired name other than unnamed,if the next column is also unnamed it shud skip and look the next column and so on...
i got a solution to delete the column which has unnamed column name with null values:but i need solution to the above
ser_all_na= df.isna().all(axis='rows')
del_indexer= ser_all_na.index.str.startswith('Unnamed: ')
del_indexer &= ser_all_na
df.drop(columns=ser_all_na[del_indexer].index, inplace=True)
df.to_csv("output.csv", index=False)
CodePudding user response:
Get the column names first by selecting the named columns, drop the unnamed columns and rename the df.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Unnamed: 0': [1],
'id': [5],
'Unnamed: 1': [3],
'Unnamed: 2': ['Company'],
'Length': ['Assigned'],
'Unnamed: 3': ['fan'],
'Types': [np.nan],
'Name': [np.nan],
'Value': [np.nan],
'Comments': [np.nan]})
cols = [x for x in df if "Unnamed" not in x]
df = df.dropna(axis=1)
df.columns = cols
print(df)
Output
id Length Types Name Value Comments
0 1 5 3 Company Assigned fan
CodePudding user response:
# select all columns that are not nameed as 'unnamed'
cols=df.columns[~df.columns.str.contains('Unnamed')]
# create a new data filtering count of columns that are not unnamed
df2=df.iloc[:,:len(cols)]
#rename the column names
df2.columns = cols
df2
id Length Types Name Value Comments
0 1 5 3 Company Assigned fan