Count occurencies of a conditional relationship between 2 dataframes and compute correlation-CodePudding

I want to find the relationship between elements of different yet connected data frames. Here I have a df that shows friendships: id1 is friends with id2, id3, id4, id5, id21 etc.

	friend1	friend2
row1	id1	id3
row2	id2	id1
row3	id5	id1
row4	id12	id2
row5	id21	id1
row6	id4	id2
row7	id7	id8
row8	id1	id4
row9	id21	id2
row10	id3	id5

Here is another dataframe where it shows when someone goes to a party. For example, Id5 went to parties on 2012-02-03 and 2012-05-09.

	person	date
row1	id1	2012-02-03
row2	id2	2012-05-09
row3	id5	2012-02-03
row4	id12	2012-05-09
row5	id21	2012-02-03
row6	id7	2012-02-22
row7	id5	2012-05-09
row8	id3	2012-02-22
row9	id8	2012-02-22
row10	id1	2012-02-22

I want to find the correlation between people attending parties depending on whether their friends attend. For example for id1:

Went to party 2012-02-03 (same day as id21, id5) and 2012-02-22 (same day as id7, id3, id8). So 2 friends on 1 occasion and 3 on another (mean=2.5 friends when he attends a party).

I would like to see the average number of friends existing at a party for each person present in the dataset. If someone has no friends, visited no parties, or visited parties without his friends then the mean will be 0.

I tried to build this using pandas methods like value_counts/groupby and dictionaries but I lost hope along the way. Thanks in advance for any help.

Here are the constructors for the dfs:

index = ['r1', 'r2', 'r3', 'r4', 'r5', 'r6', 'r7', 'r8', 'r9', 'r10']
data1 = {'friend1': ['id1', 'id2','id5','id12','id21','id4','id7','id1','id21', 'id3'],
         'friend2': ['id3', 'id1','id1','id2','id1','id2','id8','id4','id2','id5']}
data2 = {'person': ['id1', 'id2','id5','id12','id21','id7','id5','id3','id8','id1'],
         'date': ['2012-02-03', '2012-05-09', '2012-02-03', '2012-05-09', '2012-02-03','2012-02-22','2012-05-09','2012-02-22','2012-02-22','2012-02-22']}
df1 = pd.DataFrame(data1, index=index)
df2 = pd.DataFrame(data2, index=index)

CodePudding user response：

The idea is to build a dictionary with a person as key and a set with friends as value and a dictionary with parties where the value is a set of all participants. Having both of the dictionaries a loop over all persons will collect the appropriate data and calculate the mean value:

data1 = {'friend1': ['id1', 'id2','id5','id12','id21','id4','id7','id1','id21', 'id3'],
         'friend2': ['id3', 'id1','id1','id2','id1','id2','id8','id4','id2','id5']}
data2 = {'person': ['id1', 'id2','id5','id12','id21','id7','id5','id3','id8','id1'],
         'date': ['2012-02-03', '2012-05-09', '2012-02-03', '2012-05-09', '2012-02-03','2012-02-22','2012-05-09','2012-02-22','2012-02-22','2012-02-22']}

dctPersons = {}
for friend1, friend2 in zip(data1["friend1"], data1["friend2"]):
    theSet = dctPersons.get(friend1, set())
    theSet.add(friend2)
    dctPersons[friend1] = theSet
    theSet = dctPersons.get(friend2, set())
    theSet.add(friend1)
    dctPersons[friend2] = theSet
print(dctPersons)

dctParties = {}
for person, date in zip(data2["person"], data2["date"]):
    theSet = dctParties.get(date, set())
    theSet.add(person)
    dctParties[date] = theSet
print(dctParties)

dctMeanFriends = {}
for persID, setFriends in dctPersons.items():
    visitedParties = 0
    friendsAtParty = 0
    for setPersAtParty in dctParties.values():
        if persID in setPersAtParty:
            visitedParties  = 1
            friendsAtParty  = len( setPersAtParty.intersection(setFriends)) 
    dctMeanFriends[persID] = friendsAtParty / visitedParties if visitedParties > 0 else 0
print( dctMeanFriends )

dfMeanFriends = pd.DataFrame.from_dict(dctMeanFriends, orient='index') # columns / tight
print(dfMeanFriends)

outputs:

{'id1': {'id2', 'id3', 'id21', 'id5', 'id4'}, 'id3': {'id1', 'id5'}, 'id2': {'id21', 'id1', 'id4', 'id12'}, 'id5': {'id1', 'id3'}, 'id12': {'id2'}, 'id21': {'id1', 'id2'}, 'id4': {'id1', 'id2'}, 'id7': {'id8'}, 'id8': {'id7'}}
{'2012-02-03': {'id1', 'id21', 'id5'}, '2012-05-09': {'id12', 'id2', 'id5'}, '2012-02-22': {'id7', 'id1', 'id8', 'id3'}}
{'id1': 1.5, 'id3': 1.0, 'id2': 1.0, 'id5': 0.5, 'id12': 1.0, 'id21': 1.0, 'id4': 0, 'id7': 1.0, 'id8': 1.0}
        0
id1   1.5
id3   1.0
id2   1.0
id5   0.5
id12  1.0
id21  1.0
id4   0.0
id7   1.0
id8   1.0

Is the result (last dictionary) as you expect it? Or should the mean value be calculated in another way?

CodePudding user response：

A similar solution. Although since this summary doesn't seem to account for the parties which the folks didn't attend, because there were few/no friends there, I can't see how we can calculate any kind of correlation/relationship from this summary table...

data1 = {'friend1': ['id1', 'id2','id5','id12','id21','id4','id7','id1','id21', 'id3', 'id3'],
         'friend2': ['id3', 'id1','id1','id2','id1','id2','id8','id4','id2','id5', 'sits_home']}
data2 = {'person': ['id1', 'id2','id5','id12','id21','id7','id5','id3','id8','id1','id4','no_friends'],
         'date': ['2012-02-03', '2012-05-09', '2012-02-03', '2012-05-09', '2012-02-03','2012-02-22','2012-05-09','2012-02-22','2012-02-22','2012-02-22','2012-02-23','2012-02-22']}
friends = pd.DataFrame(data1)
meetings = pd.DataFrame(data2)

the_friends = set(data1['friend1']) | set(data1['friend2']) | set(data2['person'])
all_friends = {(i, i) for i in the_friends}
df = meetings.merge(meetings, on='date', how='left')
df.loc[:,'pairs'] = df.apply(lambda x: tuple(set([x.person_x, x.person_y])) if x.person_x != x.person_y else (x.person_x, x.person_y), axis=1)
friends.loc[:,'pairs'] = friends.apply(lambda x: tuple(set([x.friend1, x.friend2])), axis=1)
df['count'] = 1.0
df['friends'] = False
all_friends = all_friends | set(friends.pairs.unique())
df.loc[df.pairs.isin(all_friends),'friends'] = True
result = df.loc[df.friends,['person_x', 'date','count']].groupby(['person_x', 'date']).sum().reset_index()[['person_x','count']]
result.loc[:,'count'] = result['count'] - 1
result = result.groupby('person_x').mean()
result.index.name = 'friends'
result.columns = ['Mean number of friends at the party attended']
not_attended = the_friends - set(result.index.values)
for i in not_attended:
    result.loc[i, 'Mean number of friends at the party attended'] = 0.0
print(result)

Output:

            Mean number of friends at the party attended
friends                                                 
id1                                                  1.5
id12                                                 1.0
id2                                                  1.0
id21                                                 1.0
id3                                                  1.0
id4                                                  0.0
id5                                                  0.5
id7                                                  1.0
id8                                                  1.0
no_friends                                           0.0
sits_home                                            0.0