I have pre defined headers for my data frame . but data we receive might or might not include all the columns . however our output file should have all the fields from header. If there is no data in the input file, then keep it blank else populate the values.
input file 1:
ID Name Address Shippment
1 john address1 Y
2 Jessy address2 N
input file2
ID Name Address Shippment Delivered
1 john address1 Y Y
2 Jessy address2 N N
headers=['ID','Name','Address','Shippment','Delivered']
output file 1:
ID Name Address Shippment Delivered
1 john address1 Y
2 Jessy address2 N
output file 1:
ID Name Address Shippment Delivered
1 john address1 Y Y
2 Jessy address2 N N
how do I map the headers to the soruce file when the headers are random?
I tried using zip and update but this looks at index and map. The Source columns can be random and it has to match the fields order from the header.
for i,index in zip(header,df):
final.update({i: df[index].tolist()})
df_final= pd.DataFrame(final)
How can I acheive this ?
CodePudding user response:
You can use reindex()
to ensure the pre-defined headers are all included. Any missing columns in the data will be populated with NaN:
headers = ['ID','Name','Address','Shippment','Delivered']
df = df.reindex(columns=headers)
Input
ID Name Address Shippment
0 1 john address1 Y
1 2 jessy address2 N
Output
ID Name Address Shippment Delivered
0 1 john address1 Y NaN
1 2 jessy address2 N NaN
If the columns in the input are not in the pre-defined sequence, reindex()
will fix that as well:
Input
ID Address Name Shippment
0 1 address1 john Y
1 2 address2 jessy N
Output
ID Name Address Shippment Delivered
0 1 john address1 Y NaN
1 2 jessy address2 N NaN