I have a large messy SQL database which I have started cleaning up and I need to check if user IDs exist in the tables, preferably in one view. I have started with left joins, listing all user IDs but the performance is really poor with 5 tables in my case.
So I have decided to use python and get the IDs in lists and join them together in a pandas dataframe. My only problem is that I cannot really figure out how to do left joins with python, the same way I would do with SQL.
I have 10 lists with user IDs and ideally, I need a dataframe where the index is all user IDs and the columns represent the tables indicating whether the user ID can be found (YES / NO).
Here is what I managed to do:
users = pd.DataFrame(users)
table1 = pd.DataFrame(table1)
users = users.merge(table, how='left', indicator='table1')
Then I would need to loop through all the lists and merge them with users, rename the indicator texts like left_only = 'No'
both = 'Yes'
and reindex()
the final table.
Any suggestions on how to do this in a more elegant way?
CodePudding user response:
Okay so I figured it out. The dataset I wanted to have would have looked like this:
users table1 table2 table3
----- ------ ------ ------
1 True None None
2 True True None
3 None None True
4 None None None
5 True True True
There is no need to use pandas, it's better to use plain python to merge the columns.
users = [1, 2, 3, 4, 5]
tables = [table1, table2, table3]
data = {
'users': users
}
for table in tables:
column = []
for user in users:
if user in table:
column.append(True)
else:
column.append(None)
data[table] = column
And then to put it into a dataframe
df = pd.DataFrame.from_dict(data)
df.sort_values('users').reset_index(drop=True)