Home > Mobile >  Count occurencies of a conditional relationship between 2 dataframes and compute correlation
Count occurencies of a conditional relationship between 2 dataframes and compute correlation

Time:09-12

I want to find the relationship between elements of different yet connected data frames. Here I have a df that shows friendships: id1 is friends with id2, id3, id4, id5, id21 etc.

friend1 friend2
row1 id1 id3
row2 id2 id1
row3 id5 id1
row4 id12 id2
row5 id21 id1
row6 id4 id2
row7 id7 id8
row8 id1 id4
row9 id21 id2
row10 id3 id5

Here is another dataframe where it shows when someone goes to a party. For example, Id5 went to parties on 2012-02-03 and 2012-05-09.

person date
row1 id1 2012-02-03
row2 id2 2012-05-09
row3 id5 2012-02-03
row4 id12 2012-05-09
row5 id21 2012-02-03
row6 id7 2012-02-22
row7 id5 2012-05-09
row8 id3 2012-02-22
row9 id8 2012-02-22
row10 id1 2012-02-22

I want to find the correlation between people attending parties depending on whether their friends attend. For example for id1:

Went to party 2012-02-03 (same day as id21, id5) and 2012-02-22 (same day as id7, id3, id8). So 2 friends on 1 occasion and 3 on another (mean=2.5 friends when he attends a party).

I would like to see the average number of friends existing at a party for each person present in the dataset. If someone has no friends, visited no parties, or visited parties without his friends then the mean will be 0.

I tried to build this using pandas methods like value_counts/groupby and dictionaries but I lost hope along the way. Thanks in advance for any help.

Here are the constructors for the dfs:

index = ['r1', 'r2', 'r3', 'r4', 'r5', 'r6', 'r7', 'r8', 'r9', 'r10']
data1 = {'friend1': ['id1', 'id2','id5','id12','id21','id4','id7','id1','id21', 'id3'],
         'friend2': ['id3', 'id1','id1','id2','id1','id2','id8','id4','id2','id5']}
data2 = {'person': ['id1', 'id2','id5','id12','id21','id7','id5','id3','id8','id1'],
         'date': ['2012-02-03', '2012-05-09', '2012-02-03', '2012-05-09', '2012-02-03','2012-02-22','2012-05-09','2012-02-22','2012-02-22','2012-02-22']}
df1 = pd.DataFrame(data1, index=index)
df2 = pd.DataFrame(data2, index=index)

CodePudding user response:

The idea is to build a dictionary with a person as key and a set with friends as value and a dictionary with parties where the value is a set of all participants. Having both of the dictionaries a loop over all persons will collect the appropriate data and calculate the mean value:

data1 = {'friend1': ['id1', 'id2','id5','id12','id21','id4','id7','id1','id21', 'id3'],
         'friend2': ['id3', 'id1','id1','id2','id1','id2','id8','id4','id2','id5']}
data2 = {'person': ['id1', 'id2','id5','id12','id21','id7','id5','id3','id8','id1'],
         'date': ['2012-02-03', '2012-05-09', '2012-02-03', '2012-05-09', '2012-02-03','2012-02-22','2012-05-09','2012-02-22','2012-02-22','2012-02-22']}

dctPersons = {}
for friend1, friend2 in zip(data1["friend1"], data1["friend2"]):
    theSet = dctPersons.get(friend1, set())
    theSet.add(friend2)
    dctPersons[friend1] = theSet
    theSet = dctPersons.get(friend2, set())
    theSet.add(friend1)
    dctPersons[friend2] = theSet
print(dctPersons)

dctParties = {}
for person, date in zip(data2["person"], data2["date"]):
    theSet = dctParties.get(date, set())
    theSet.add(person)
    dctParties[date] = theSet
print(dctParties)

dctMeanFriends = {}
for persID, setFriends in dctPersons.items():
    visitedParties = 0
    friendsAtParty = 0
    for setPersAtParty in dctParties.values():
        if persID in setPersAtParty:
            visitedParties  = 1
            friendsAtParty  = len( setPersAtParty.intersection(setFriends)) 
    dctMeanFriends[persID] = friendsAtParty / visitedParties if visitedParties > 0 else 0
print( dctMeanFriends )

dfMeanFriends = pd.DataFrame.from_dict(dctMeanFriends, orient='index') # columns / tight
print(dfMeanFriends)

outputs:

{'id1': {'id2', 'id3', 'id21', 'id5', 'id4'}, 'id3': {'id1', 'id5'}, 'id2': {'id21', 'id1', 'id4', 'id12'}, 'id5': {'id1', 'id3'}, 'id12': {'id2'}, 'id21': {'id1', 'id2'}, 'id4': {'id1', 'id2'}, 'id7': {'id8'}, 'id8': {'id7'}}
{'2012-02-03': {'id1', 'id21', 'id5'}, '2012-05-09': {'id12', 'id2', 'id5'}, '2012-02-22': {'id7', 'id1', 'id8', 'id3'}}
{'id1': 1.5, 'id3': 1.0, 'id2': 1.0, 'id5': 0.5, 'id12': 1.0, 'id21': 1.0, 'id4': 0, 'id7': 1.0, 'id8': 1.0}
        0
id1   1.5
id3   1.0
id2   1.0
id5   0.5
id12  1.0
id21  1.0
id4   0.0
id7   1.0
id8   1.0

Is the result (last dictionary) as you expect it? Or should the mean value be calculated in another way?

CodePudding user response:

A similar solution. Although since this summary doesn't seem to account for the parties which the folks didn't attend, because there were few/no friends there, I can't see how we can calculate any kind of correlation/relationship from this summary table...

data1 = {'friend1': ['id1', 'id2','id5','id12','id21','id4','id7','id1','id21', 'id3', 'id3'],
         'friend2': ['id3', 'id1','id1','id2','id1','id2','id8','id4','id2','id5', 'sits_home']}
data2 = {'person': ['id1', 'id2','id5','id12','id21','id7','id5','id3','id8','id1','id4','no_friends'],
         'date': ['2012-02-03', '2012-05-09', '2012-02-03', '2012-05-09', '2012-02-03','2012-02-22','2012-05-09','2012-02-22','2012-02-22','2012-02-22','2012-02-23','2012-02-22']}
friends = pd.DataFrame(data1)
meetings = pd.DataFrame(data2)

the_friends = set(data1['friend1']) | set(data1['friend2']) | set(data2['person'])
all_friends = {(i, i) for i in the_friends}
df = meetings.merge(meetings, on='date', how='left')
df.loc[:,'pairs'] = df.apply(lambda x: tuple(set([x.person_x, x.person_y])) if x.person_x != x.person_y else (x.person_x, x.person_y), axis=1)
friends.loc[:,'pairs'] = friends.apply(lambda x: tuple(set([x.friend1, x.friend2])), axis=1)
df['count'] = 1.0
df['friends'] = False
all_friends = all_friends | set(friends.pairs.unique())
df.loc[df.pairs.isin(all_friends),'friends'] = True
result = df.loc[df.friends,['person_x', 'date','count']].groupby(['person_x', 'date']).sum().reset_index()[['person_x','count']]
result.loc[:,'count'] = result['count'] - 1
result = result.groupby('person_x').mean()
result.index.name = 'friends'
result.columns = ['Mean number of friends at the party attended']
not_attended = the_friends - set(result.index.values)
for i in not_attended:
    result.loc[i, 'Mean number of friends at the party attended'] = 0.0
print(result)

Output:

            Mean number of friends at the party attended
friends                                                 
id1                                                  1.5
id12                                                 1.0
id2                                                  1.0
id21                                                 1.0
id3                                                  1.0
id4                                                  0.0
id5                                                  0.5
id7                                                  1.0
id8                                                  1.0
no_friends                                           0.0
sits_home                                            0.0
  • Related