I need to:
- Join table1 and table2
- Eliminate duplicates
- Keep the originals from table1
- A dictionary to say which was the id in the old table and which is the new id
Example: The output would be something like this
PS: Thing is, table1 originates from an already in production database, and the id I have here is used in many other tables so I CAN'T change what's already on it, only add the new data that's not already on it. But I will also need to say what's the new id of the data.
table1
id name birthdate
1 Goku 1997-12-15
2 Freeza 2000-10-03
3 Vegeta 2003-08-19
table2
id name birthdate
1 Krillin 1983-02-28
2 Roshi 1960-06-07
3 Goku 1997-12-15
4 Freeza 1998-10-10
So what I need to generate from this are the following
resulting_table1
id name birthdate
1 Goku 1997-12-15
2 Freeza 2000-10-03
3 Vegeta 2003-08-19
4 Krillin 1983-02-28
5 Roshi 1960-06-07
6 Freeza 1998-10-10
but I also need a table that says which code a person was on the old table, and which is the new code, which would also, something like that:
from_to_table
id origin new_id
1 table_1 1
2 table_1 2
3 table_1 3
1 table_2 4
2 table_2 5
3 table_2 1
4 table_2 6
I have tried many methods and the only one I got to work now was doing the insertion row by row and doing the checking on the two fields each time, but that just takes way too much time making it not viable.
So far the best way I found consists basically in: Joining the two tables -> Grouping the data and generating new id column -> join the grouped table with the joined two tables to create the from_to_table problem is, that approach will change the ids I must not change, and I don't know how to keep those.
CodePudding user response:
I'm assuming that id
is a column, not the index:
table1 =
id name birthdate
0 1 Goku 1997-12-15
1 2 Freeza 2000-10-03
2 3 Vegeta 2003-08-19
Then you could try the following:
(1) Create a joined table_tmp
with some additional content:
table_tmp = pd.concat([table1.assign(table=1), table2.assign(table=2)])
id name birthdate table
0 1 Goku 1997-12-15 1
1 2 Freeza 2000-10-03 1
2 3 Vegeta 2003-08-19 1
0 1 Krillin 1983-02-28 2
1 2 Roshi 1960-06-07 2
2 3 Goku 1997-12-15 2
3 4 Freeza 1998-10-10 2
(2) Based on that create resulting_table1
:
resulting_table1 = (
table_tmp
.drop_duplicates(["name", "birthdate"])
.reset_index(drop=True)
.assign(id=lambda df: df.index 1)
.drop(columns="table")
)
id name birthdate
0 1 Goku 1997-12-15
1 2 Freeza 2000-10-03
2 3 Vegeta 2003-08-19
3 4 Krillin 1983-02-28
4 5 Roshi 1960-06-07
5 6 Freeza 1998-10-10
(3) And then use both to create from_to_table
:
from_to_table = (
table_tmp
.merge(resulting_table1, on=["name", "birthdate"], how="left")
.drop(columns=["name", "birthdate"])
.rename(columns={"id_x": "id", "id_y": "id_new"})
)
id table id_new
0 1 1 1
1 2 1 2
2 3 1 3
3 1 2 4
4 2 2 5
5 3 2 1
6 4 2 6
CodePudding user response:
For resulting_table1
, I suggest to use merge
for an an outer join on the columns name
and birthdate
, and then re-create the id
column:
resulting_table1 = pd.merge(table1, table2, on=['name','birthdate'], how='outer')[['name','birthdate']]
resulting_table1['id'] = range(1, len(resulting_table1) 1)
For from_to_table
, you can use another outer join (this time on all columns) and use the indicator
flag to keep the information about the source table:
from_to_table = pd.merge(table1, table2, how='outer', indicator='origin').replace({'origin':{'left_only':'table_1', 'right_only':'table_2'}})
And finally do a left join of resulting_table1
for the new id:
from_to_table = from_to_table.merge(resulting_table1, on=['name','birthdate'], how="left")