Here is my current code
data = pd.read_csv('file', sep='\t', header=[2])
ndf = pd.DataFrame(data=nd)
new_data = pd.concat([data, ndf])
new_data.to_csv('file', sep='\t', index=False, mode='a', header=False)
So the file I am reading has 3 rows of headers, the headers in the first 2 rows are not used but I need to keep them there.
The headers in row 3 are the same as the headers in ndf, when I concat data and ndf the new_data dataframe is correctly aligned. So there's no problem there.
The problem comes when I try to write the new_data back to the original file with append mode. Every row of data that was in the original file is duplicated. This happens each time.
I have tried adding drop_duplicates new_data = pd.concat([data, ndf]).drop_duplicates(subset='item_sku', keep=False)
But this still leaves me with 2 of each row each time I write back to file.
I also tried reading the file with multiple header rows: header=[0, 1, 2]
But this makes the concat fail, I'm guessing because it's I haven't told the concat function which row of headers to align with. I think passing keys= would work but I'm not understanding the documentation very well.
EDIT-
This is an example of the file I am reading
load v1.0 74b FlatFile
ver raid week month
Dept Date Sales IsHoliday
1 2010-02-05 24924.50 False
This would be the data I am trying to append
Dept Date Sales IsHoliday
3 2010-07-05 6743.50 False
And this is the output I am getting
load v1.0 74b FlatFile
ver raid week month
Dept Date Sales IsHoliday
1 2010-02-05 24924.50 False
1 2010-02-05 24924.50 False
3 2010-07-05 6743.50 False
CodePudding user response:
Try re-setting the columns of nd
to the three-level header before concat
:
data = pd.read_csv("file1.csv",sep="\t",header=[0,1,2])
nd = pd.read_csv("file2.csv",sep="\t")
nd.columns = data.columns
output = pd.concat([data,nd])
output.to_csv('file', sep='\t', index=False)
>>> output
load v1.0 74b FlatFile
ver raid week month
Dept Date Sales IsHoliday
0 1 2010-02-05 24924.5 False
0 3 2010-07-05 6743.5 False
CodePudding user response:
I'm sure there's a better way of doing it but I've ended up with this result that works.
data = pd.read_csv('file', sep='\t', header=[0, 1, 2])
columns = data.columns
ndf = pd.DataFrame(data=nd, columns=data.columns.get_level_values(2))
data.columns = data.columns.get_level_values(2)
new_data = pd.concat([data, ndf])
new_data.columns = columns
new_data.to_csv('file', sep='\t', index=False, header=True)
So what I did was, set the ndf to have the same columns as the third row of data, then did the same data.
This allowed me to concat the two dataframes.
I still had the issue that I was missing the first 2 rows of headers but if I saved the columns from the original data file I could then asign the columns, back to the original values, before I saved to csv.