I have two .csv files where the first column is some name and other columns describe that name. I want to process those files in a way where in the second file I leave only rows where first column is the same in both files. Also duplicates should be removed for further processing. Data in columns other than column zero can be different in both files.
test1.csv
cat,house,guest
dog,environment
mouse,broom
mouse,broom
test2.csv
cat,house,boy
dog,environment
mouse,broom,hole
path,leave
cat,house,boy
What I want to receive:
1_processed.csv
cat,house,guest
dog,environment
mouse,broom
2_processed.csv
cat,house,boy
dog,environment
mouse,broom,hole
I tried following code with pandas to achieve it, however it does not work as expected:
import pandas as pd
file1 = "test1.csv"
file2 = "test2.csv"
if __name__ == '__main__':
print("Getting columns from file")
df1 = pd.read_csv(file1, header=None, sep="delimiter", engine="python")
print("Getting columns from file")
df2 = pd.read_csv(file2, header=None, sep="delimiter", engine="python")
print("Leave only rows where first column is same in both files")
df2.drop(df2[~df2[0].isin(df1[0])], inplace=True)
print("Remove duplicated rows")
df1.drop_duplicates(subset=None, inplace=True)
df2.drop_duplicates(subset=None, inplace=True)
print("Write processed data to files")
df1.to_csv("1_processed", index=False, header=False)
df2.to_csv("2_processed", index=False, header=False)
Output:
1_processed.csv
"cat,house,guest"
"dog,environment"
"mouse,broom"
2_processed.csv
"dog,environment"
"mouse,broom,hole"
"path,leave" #this value should not appear here
"cat,house,boy"
CodePudding user response:
Try with pandas.DataFrame.eq
:
import pandas as pd
file1 = "test1.csv"
file2 = "test2.csv"
if __name__ == '__main__':
print("Getting columns from file1")
df1 = pd.read_csv(file1, header=None)
print("Getting columns from file2")
df2 = pd.read_csv(file2, header=None)
print("Remove duplicated rows")
df1 = df1.drop_duplicates()
df2 = df2.drop_duplicates()
print("Leave only rows where first column is same in both files")
df2 = df2[df1.eq(df2)[0]]
print("Write processed data to files")
df1.to_csv("1_processed.csv", index=False, header=False)
df2.to_csv("2_processed.csv", index=False, header=False)
# Output :
print(df1)
0 1 2
0 cat house guest
1 dog environment NaN
2 mouse broom NaN
print(df2)
0 1 2
0 cat house boy
1 dog environment NaN
2 mouse broom hole