I have 2 csv files with ~10000 lines :
- csv of name files from directory
- csv with datas for each file in this directory (point 1)
Example of content for each csv file :
csv_1 :
50001200000000016
50001200000000021
50001200000000034
50001200000000048
csv_2:
50001200000000016;187
50001200000000021;287
50001200000000034;187
50001200000000048;5
I want to keep in csv_2 only the lines where the first column match and exist in csv_1
Ex:
If in csv_1 the line 50001200000000016 dont exist, delete the row in csv_2 who begins with 50001200000000016
Thx for help
CodePudding user response:
There are many ways to do this. If the csv is simple (i.e. no tricky quoting or characters, only those two columns), then you could read the first file as set
and loop over the files of the second one.
However, given the specifications you gave (only 10k lines), this shouldn't require any particular optimization and should be easily achievable in memory with pandas
:
import pandas as pd
df1 = pd.read_csv('csv_1.csv', header=None)
df2 = pd.read_csv('csv_2.csv', header=None, sep=';')
df2[df2[0].isin(df1[0])].to_csv('new_file.csv', sep=';', header=None, index=None)
CodePudding user response:
The problem was the file encoding ; here is the code that's work on PyCharm / Jupiter Notebook
import pandas as pd
df1 = pd.read_csv(r'csv_1.csv', encoding='ANSI', header=None)
print(df1)
df2 = pd.read_csv(r'csv_2.csv', encoding='ANSI', header=None, sep=';')
print(df2)
df2[df2[0].isin(df1[0])].to_csv('new_file.csv', encoding='ANSI', sep=';', header=None, index=None)
Thx all.