Home > Mobile >  Leave only rows where first column is same in both files
Leave only rows where first column is same in both files

Time:09-28

I have two .csv files where the first column is some name and other columns describe that name. I want to process those files in a way where in the second file I leave only rows where first column is the same in both files. Also duplicates should be removed for further processing. Data in columns other than column zero can be different in both files.

test1.csv

cat,house,guest
dog,environment
mouse,broom
mouse,broom

test2.csv

cat,house,boy
dog,environment
mouse,broom,hole
path,leave
cat,house,boy

What I want to receive:

1_processed.csv

cat,house,guest
dog,environment
mouse,broom

2_processed.csv

cat,house,boy
dog,environment
mouse,broom,hole

I tried following code with pandas to achieve it, however it does not work as expected:

import pandas as pd

file1 = "test1.csv"
file2 = "test2.csv"

if __name__ == '__main__':
    print("Getting columns from file")
    df1 = pd.read_csv(file1, header=None, sep="delimiter", engine="python")

    print("Getting columns from file")
    df2 = pd.read_csv(file2, header=None, sep="delimiter", engine="python")

    print("Leave only rows where first column is same in both files")
    df2.drop(df2[~df2[0].isin(df1[0])], inplace=True)

    print("Remove duplicated rows")
    df1.drop_duplicates(subset=None, inplace=True)
    df2.drop_duplicates(subset=None, inplace=True)

    print("Write processed data to files")
    df1.to_csv("1_processed", index=False, header=False)
    df2.to_csv("2_processed", index=False, header=False)

Output:

1_processed.csv

"cat,house,guest"
"dog,environment"
"mouse,broom"

2_processed.csv

"dog,environment"
"mouse,broom,hole"
"path,leave"  #this value should not appear here
"cat,house,boy"

CodePudding user response:

Try with pandas.DataFrame.eq :

import pandas as pd

file1 = "test1.csv"
file2 = "test2.csv"

if __name__ == '__main__':
    print("Getting columns from file1")
    df1 = pd.read_csv(file1, header=None)

    print("Getting columns from file2")
    df2 = pd.read_csv(file2, header=None)

    print("Remove duplicated rows")
    df1 = df1.drop_duplicates()
    df2 = df2.drop_duplicates()

    print("Leave only rows where first column is same in both files")
    df2 = df2[df1.eq(df2)[0]]

    print("Write processed data to files")
    df1.to_csv("1_processed.csv", index=False, header=False)
    df2.to_csv("2_processed.csv", index=False, header=False)

# Output :

print(df1)  

       0            1      2
0    cat        house  guest
1    dog  environment    NaN
2  mouse        broom    NaN

print(df2)

       0            1     2
0    cat        house   boy
1    dog  environment   NaN
2  mouse        broom  hole
  • Related