Home > Blockchain >  Pandas data frame, to_csv creating duplicate rows
Pandas data frame, to_csv creating duplicate rows

Time:05-04

Here is my current code

data = pd.read_csv('file', sep='\t', header=[2])
ndf = pd.DataFrame(data=nd)
new_data = pd.concat([data, ndf])
new_data.to_csv('file', sep='\t', index=False, mode='a', header=False)

So the file I am reading has 3 rows of headers, the headers in the first 2 rows are not used but I need to keep them there.
The headers in row 3 are the same as the headers in ndf, when I concat data and ndf the new_data dataframe is correctly aligned. So there's no problem there.

The problem comes when I try to write the new_data back to the original file with append mode. Every row of data that was in the original file is duplicated. This happens each time.

I have tried adding drop_duplicates new_data = pd.concat([data, ndf]).drop_duplicates(subset='item_sku', keep=False)
But this still leaves me with 2 of each row each time I write back to file.

I also tried reading the file with multiple header rows: header=[0, 1, 2]
But this makes the concat fail, I'm guessing because it's I haven't told the concat function which row of headers to align with. I think passing keys= would work but I'm not understanding the documentation very well.

EDIT-
This is an example of the file I am reading

load  v1.0            74b          FlatFile
ver   raid            week         month    
Dept  Date            Sales        IsHoliday
1     2010-02-05      24924.50     False

This would be the data I am trying to append

Dept  Date            Sales        IsHoliday
3     2010-07-05      6743.50      False

And this is the output I am getting

load  v1.0            74b          FlatFile
ver   raid            week         month    
Dept  Date            Sales        IsHoliday
1     2010-02-05      24924.50     False
1     2010-02-05      24924.50     False
3     2010-07-05      6743.50      False

CodePudding user response:

Try re-setting the columns of nd to the three-level header before concat:

data = pd.read_csv("file1.csv",sep="\t",header=[0,1,2])
nd = pd.read_csv("file2.csv",sep="\t")
nd.columns = data.columns

output = pd.concat([data,nd])
output.to_csv('file', sep='\t', index=False)

>>> output
  load        v1.0      74b  FlatFile
   ver        raid     week     month
  Dept        Date    Sales IsHoliday
0    1  2010-02-05  24924.5     False
0    3  2010-07-05   6743.5     False

CodePudding user response:

I'm sure there's a better way of doing it but I've ended up with this result that works.

data = pd.read_csv('file', sep='\t', header=[0, 1, 2])
columns = data.columns

ndf = pd.DataFrame(data=nd, columns=data.columns.get_level_values(2))
data.columns = data.columns.get_level_values(2)
new_data = pd.concat([data, ndf])

new_data.columns = columns
new_data.to_csv('file', sep='\t', index=False, header=True)

So what I did was, set the ndf to have the same columns as the third row of data, then did the same data.
This allowed me to concat the two dataframes.
I still had the issue that I was missing the first 2 rows of headers but if I saved the columns from the original data file I could then asign the columns, back to the original values, before I saved to csv.

  • Related