I want to find the relationship between elements of different yet connected data frames. Here I have a df that shows friendships: id1 is friends with id2, id3, id4, id5, id21 etc.
friend1 | friend2 | |
---|---|---|
row1 | id1 | id3 |
row2 | id2 | id1 |
row3 | id5 | id1 |
row4 | id12 | id2 |
row5 | id21 | id1 |
row6 | id4 | id2 |
row7 | id7 | id8 |
row8 | id1 | id4 |
row9 | id21 | id2 |
row10 | id3 | id5 |
Here is another dataframe where it shows when someone goes to a party. For example, Id5 went to parties on 2012-02-03 and 2012-05-09.
person | date | |
---|---|---|
row1 | id1 | 2012-02-03 |
row2 | id2 | 2012-05-09 |
row3 | id5 | 2012-02-03 |
row4 | id12 | 2012-05-09 |
row5 | id21 | 2012-02-03 |
row6 | id7 | 2012-02-22 |
row7 | id5 | 2012-05-09 |
row8 | id3 | 2012-02-22 |
row9 | id8 | 2012-02-22 |
row10 | id1 | 2012-02-22 |
I want to find the correlation between people attending parties depending on whether their friends attend. For example for id1:
Went to party 2012-02-03 (same day as id21, id5) and 2012-02-22 (same day as id7, id3, id8). So 2 friends on 1 occasion and 3 on another (mean=2.5 friends when he attends a party).
I would like to see the average number of friends existing at a party for each person present in the dataset. If someone has no friends, visited no parties, or visited parties without his friends then the mean will be 0.
I tried to build this using pandas methods like value_counts/groupby and dictionaries but I lost hope along the way. Thanks in advance for any help.
Here are the constructors for the dfs:
index = ['r1', 'r2', 'r3', 'r4', 'r5', 'r6', 'r7', 'r8', 'r9', 'r10']
data1 = {'friend1': ['id1', 'id2','id5','id12','id21','id4','id7','id1','id21', 'id3'],
'friend2': ['id3', 'id1','id1','id2','id1','id2','id8','id4','id2','id5']}
data2 = {'person': ['id1', 'id2','id5','id12','id21','id7','id5','id3','id8','id1'],
'date': ['2012-02-03', '2012-05-09', '2012-02-03', '2012-05-09', '2012-02-03','2012-02-22','2012-05-09','2012-02-22','2012-02-22','2012-02-22']}
df1 = pd.DataFrame(data1, index=index)
df2 = pd.DataFrame(data2, index=index)
CodePudding user response:
The idea is to build a dictionary with a person as key and a set with friends as value and a dictionary with parties where the value is a set of all participants. Having both of the dictionaries a loop over all persons will collect the appropriate data and calculate the mean value:
data1 = {'friend1': ['id1', 'id2','id5','id12','id21','id4','id7','id1','id21', 'id3'],
'friend2': ['id3', 'id1','id1','id2','id1','id2','id8','id4','id2','id5']}
data2 = {'person': ['id1', 'id2','id5','id12','id21','id7','id5','id3','id8','id1'],
'date': ['2012-02-03', '2012-05-09', '2012-02-03', '2012-05-09', '2012-02-03','2012-02-22','2012-05-09','2012-02-22','2012-02-22','2012-02-22']}
dctPersons = {}
for friend1, friend2 in zip(data1["friend1"], data1["friend2"]):
theSet = dctPersons.get(friend1, set())
theSet.add(friend2)
dctPersons[friend1] = theSet
theSet = dctPersons.get(friend2, set())
theSet.add(friend1)
dctPersons[friend2] = theSet
print(dctPersons)
dctParties = {}
for person, date in zip(data2["person"], data2["date"]):
theSet = dctParties.get(date, set())
theSet.add(person)
dctParties[date] = theSet
print(dctParties)
dctMeanFriends = {}
for persID, setFriends in dctPersons.items():
visitedParties = 0
friendsAtParty = 0
for setPersAtParty in dctParties.values():
if persID in setPersAtParty:
visitedParties = 1
friendsAtParty = len( setPersAtParty.intersection(setFriends))
dctMeanFriends[persID] = friendsAtParty / visitedParties if visitedParties > 0 else 0
print( dctMeanFriends )
dfMeanFriends = pd.DataFrame.from_dict(dctMeanFriends, orient='index') # columns / tight
print(dfMeanFriends)
outputs:
{'id1': {'id2', 'id3', 'id21', 'id5', 'id4'}, 'id3': {'id1', 'id5'}, 'id2': {'id21', 'id1', 'id4', 'id12'}, 'id5': {'id1', 'id3'}, 'id12': {'id2'}, 'id21': {'id1', 'id2'}, 'id4': {'id1', 'id2'}, 'id7': {'id8'}, 'id8': {'id7'}}
{'2012-02-03': {'id1', 'id21', 'id5'}, '2012-05-09': {'id12', 'id2', 'id5'}, '2012-02-22': {'id7', 'id1', 'id8', 'id3'}}
{'id1': 1.5, 'id3': 1.0, 'id2': 1.0, 'id5': 0.5, 'id12': 1.0, 'id21': 1.0, 'id4': 0, 'id7': 1.0, 'id8': 1.0}
0
id1 1.5
id3 1.0
id2 1.0
id5 0.5
id12 1.0
id21 1.0
id4 0.0
id7 1.0
id8 1.0
Is the result (last dictionary) as you expect it? Or should the mean value be calculated in another way?
CodePudding user response:
A similar solution. Although since this summary doesn't seem to account for the parties which the folks didn't attend, because there were few/no friends there, I can't see how we can calculate any kind of correlation/relationship from this summary table...
data1 = {'friend1': ['id1', 'id2','id5','id12','id21','id4','id7','id1','id21', 'id3', 'id3'],
'friend2': ['id3', 'id1','id1','id2','id1','id2','id8','id4','id2','id5', 'sits_home']}
data2 = {'person': ['id1', 'id2','id5','id12','id21','id7','id5','id3','id8','id1','id4','no_friends'],
'date': ['2012-02-03', '2012-05-09', '2012-02-03', '2012-05-09', '2012-02-03','2012-02-22','2012-05-09','2012-02-22','2012-02-22','2012-02-22','2012-02-23','2012-02-22']}
friends = pd.DataFrame(data1)
meetings = pd.DataFrame(data2)
the_friends = set(data1['friend1']) | set(data1['friend2']) | set(data2['person'])
all_friends = {(i, i) for i in the_friends}
df = meetings.merge(meetings, on='date', how='left')
df.loc[:,'pairs'] = df.apply(lambda x: tuple(set([x.person_x, x.person_y])) if x.person_x != x.person_y else (x.person_x, x.person_y), axis=1)
friends.loc[:,'pairs'] = friends.apply(lambda x: tuple(set([x.friend1, x.friend2])), axis=1)
df['count'] = 1.0
df['friends'] = False
all_friends = all_friends | set(friends.pairs.unique())
df.loc[df.pairs.isin(all_friends),'friends'] = True
result = df.loc[df.friends,['person_x', 'date','count']].groupby(['person_x', 'date']).sum().reset_index()[['person_x','count']]
result.loc[:,'count'] = result['count'] - 1
result = result.groupby('person_x').mean()
result.index.name = 'friends'
result.columns = ['Mean number of friends at the party attended']
not_attended = the_friends - set(result.index.values)
for i in not_attended:
result.loc[i, 'Mean number of friends at the party attended'] = 0.0
print(result)
Output:
Mean number of friends at the party attended
friends
id1 1.5
id12 1.0
id2 1.0
id21 1.0
id3 1.0
id4 0.0
id5 0.5
id7 1.0
id8 1.0
no_friends 0.0
sits_home 0.0