Home > Enterprise >  Python - Compare 2 csv files and delete rows
Python - Compare 2 csv files and delete rows

Time:10-28

I have 2 csv files with ~10000 lines :

  1. csv of name files from directory
  2. csv with datas for each file in this directory (point 1)

Example of content for each csv file :
csv_1 :

50001200000000016
50001200000000021
50001200000000034
50001200000000048

csv_2:

50001200000000016;187
50001200000000021;287
50001200000000034;187
50001200000000048;5

I want to keep in csv_2 only the lines where the first column match and exist in csv_1

Ex:
If in csv_1 the line 50001200000000016 dont exist, delete the row in csv_2 who begins with 50001200000000016

Thx for help

CodePudding user response:

There are many ways to do this. If the csv is simple (i.e. no tricky quoting or characters, only those two columns), then you could read the first file as set and loop over the files of the second one.

However, given the specifications you gave (only 10k lines), this shouldn't require any particular optimization and should be easily achievable in memory with pandas:

import pandas as pd

df1 = pd.read_csv('csv_1.csv', header=None)
df2 = pd.read_csv('csv_2.csv', header=None, sep=';')

df2[df2[0].isin(df1[0])].to_csv('new_file.csv', sep=';', header=None, index=None)

CodePudding user response:

The problem was the file encoding ; here is the code that's work on PyCharm / Jupiter Notebook

import pandas as pd

df1 = pd.read_csv(r'csv_1.csv', encoding='ANSI', header=None)
print(df1)
df2 = pd.read_csv(r'csv_2.csv', encoding='ANSI', header=None, sep=';')
print(df2)

df2[df2[0].isin(df1[0])].to_csv('new_file.csv', encoding='ANSI', sep=';', header=None, index=None)

Thx all.

  • Related