Let's say there are two databases where people their usernames and their network (followers and following) are stored. Both databases consist of the following tables:
Users | Relation
------------------------------------------
- id (Primary Key) | - id (Primary Key)
- username (unique) | - follower_id (Foreign key references users id)
| - following_id (Foreign key references users id)
Assuming some of the usernames in both datasets are the same, some usernames don't exist in the other database, and some usernames are different but refer to the same person. Also, their network have similar tendencies but may slightly vary.
Are there any known or suggested ways to do some similarity on the usernames and their network to either correlate them to a profile in the other database or to None, if there isn't any?
I'm working in python, but any pseudo-code, general answer or idea of how to approach this issue would suffice.
CodePudding user response:
First you have to use fuzzy search by username in both databases. Let's call this method List getFuzzyName(id) You will get the list of 'probably the same person names' Next use the same fuzzy search for getFuzzyName(follower_id) and getFuzzyName(following_id) and get some number of matching friends. Here getFuzzyName has O(n^2) complexity without optimization, as you need to compare each friend from db1 with each one from db2. The persons from DB1 and DB2 who has this number bigger then some limit is most probably the same person.
CodePudding user response:
If i understand correctly,you have two databases. One of them is users, other one is networks. So if follower_id and following_id are equal, you wanna merge these datas.
If you had two dataframes, you can easily combine them using left on and right_on.
pd.merge(
df_users,
df_relations,
how='outer',
left_on='id',
right_on='relative_id',
)
Suppose one of the user has no relatives,Then let's include them as Nan in the result using how='outer'
If you wanna write SQL query, you just simply use "=".
SELECT * from user
FULL OUTER JOIN info
ON user.id = relative.id