Python Pandas: Join two tables keeping no duplicates but also not changing the first table-CodePudding

I need to:

Join table1 and table2
Eliminate duplicates
Keep the originals from table1
A dictionary to say which was the id in the old table and which is the new id

Example: The output would be something like this

PS: Thing is, table1 originates from an already in production database, and the id I have here is used in many other tables so I CAN'T change what's already on it, only add the new data that's not already on it. But I will also need to say what's the new id of the data.

table1

id   name        birthdate     
1    Goku        1997-12-15 
2    Freeza      2000-10-03
3    Vegeta      2003-08-19

table2

id    name        birthdate
1     Krillin     1983-02-28
2     Roshi       1960-06-07
3     Goku        1997-12-15
4     Freeza      1998-10-10

So what I need to generate from this are the following

resulting_table1

id    name        birthdate     
1     Goku        1997-12-15 
2     Freeza      2000-10-03
3     Vegeta      2003-08-19
4     Krillin     1983-02-28
5     Roshi       1960-06-07
6     Freeza      1998-10-10

but I also need a table that says which code a person was on the old table, and which is the new code, which would also, something like that:

from_to_table

id   origin      new_id
1    table_1     1
2    table_1     2
3    table_1     3
1    table_2     4
2    table_2     5
3    table_2     1
4    table_2     6

I have tried many methods and the only one I got to work now was doing the insertion row by row and doing the checking on the two fields each time, but that just takes way too much time making it not viable.

So far the best way I found consists basically in: Joining the two tables -> Grouping the data and generating new id column -> join the grouped table with the joined two tables to create the from_to_table problem is, that approach will change the ids I must not change, and I don't know how to keep those.

CodePudding user response：

I'm assuming that id is a column, not the index:

table1 =
   id    name   birthdate
0   1    Goku  1997-12-15
1   2  Freeza  2000-10-03
2   3  Vegeta  2003-08-19

Then you could try the following:

(1) Create a joined table_tmp with some additional content:

table_tmp = pd.concat([table1.assign(table=1), table2.assign(table=2)])

   id     name   birthdate  table
0   1     Goku  1997-12-15      1
1   2   Freeza  2000-10-03      1
2   3   Vegeta  2003-08-19      1
0   1  Krillin  1983-02-28      2
1   2    Roshi  1960-06-07      2
2   3     Goku  1997-12-15      2
3   4   Freeza  1998-10-10      2

(2) Based on that create resulting_table1:

resulting_table1 = (
    table_tmp
    .drop_duplicates(["name", "birthdate"])
    .reset_index(drop=True)
    .assign(id=lambda df: df.index   1)
    .drop(columns="table")
)

   id     name   birthdate
0   1     Goku  1997-12-15
1   2   Freeza  2000-10-03
2   3   Vegeta  2003-08-19
3   4  Krillin  1983-02-28
4   5    Roshi  1960-06-07
5   6   Freeza  1998-10-10

(3) And then use both to create from_to_table:

from_to_table = (
    table_tmp
    .merge(resulting_table1, on=["name", "birthdate"], how="left")
    .drop(columns=["name", "birthdate"])
    .rename(columns={"id_x": "id", "id_y": "id_new"})
)

   id  table  id_new
0   1      1       1
1   2      1       2
2   3      1       3
3   1      2       4
4   2      2       5
5   3      2       1
6   4      2       6

CodePudding user response：

For resulting_table1, I suggest to use merge for an an outer join on the columns name and birthdate, and then re-create the id column:

resulting_table1 = pd.merge(table1, table2, on=['name','birthdate'], how='outer')[['name','birthdate']]
resulting_table1['id'] = range(1, len(resulting_table1) 1)

For from_to_table, you can use another outer join (this time on all columns) and use the indicator flag to keep the information about the source table:

from_to_table = pd.merge(table1, table2, how='outer', indicator='origin').replace({'origin':{'left_only':'table_1', 'right_only':'table_2'}})

And finally do a left join of resulting_table1 for the new id:

from_to_table = from_to_table.merge(resulting_table1, on=['name','birthdate'], how="left")