I have a Pandas Dataframe that get generated every day and the list of columns present in the Dataframe can vary when it gets generated each time.
I am trying to see if I can sort the sequence in which the columns are stored as the final output of the Dataframe in a specific format. If new columns are present they are placed towards the end.
Given below is how I am trying to build this final output
expected_columns = ['cust_id','cost_id','sale_id','prod_id']
Sample Dataframe columns:
['customer_name','cust_id','sale_id','sale_time']
I would like the above Dataframe structured as below:
['cust_id','sale_id','customer_name','sale_time']
Basically the columns in expected_columns
takes the first priority and then place the other columns in the Dataframe as successive column of the new Dataframe.
CodePudding user response:
First idea is use list comprehension and join lists by
:
expected_columns = ['cust_id','cost_id','sale_id','prod_id']
df = pd.DataFrame(columns=['customer_name','cust_id','sale_id','sale_time'])
expected_columns = ['cust_id','cost_id','sale_id','prod_id']
new1 = [c for c in df.columns if c in expected_columns]
new2 = [c for c in df.columns if c not in expected_columns]
new = new1 new2
print (new)
['cust_id','sale_id','customer_name','sale_time']
Or using Index.intersection
with Index.difference
:
expected_columns = ['cust_id','cost_id','sale_id','prod_id']
new = (df.columns.intersection(expected_columns, sort=False).tolist()
df.columns.difference(expected_columns, sort=False).tolist())
If also ordering in ouput by expected_columns
is important use:
new = (pd.Index(expected_columns).intersection(df.columns, sort=False).tolist()
df.columns.difference(expected_columns, sort=False).tolist())
Difference is changed sample data:
expected_columns = ['sale_id','cost_id','cust_id','prod_id']
df = pd.DataFrame(columns=['customer_name','cust_id','sale_id','sale_time'])
new = (pd.Index(expected_columns).intersection(df.columns, sort=False).tolist()
df.columns.difference(expected_columns, sort=False).tolist())
print (new)
['sale_id', 'cust_id', 'customer_name', 'sale_time']
new = (df.columns.intersection(expected_columns, sort=False).tolist()
df.columns.difference(expected_columns, sort=False).tolist())
print (new)
['cust_id', 'sale_id', 'customer_name', 'sale_time']
Last for change order of columns use:
df = df[new]