I have a dataframe with a single column which contains the names of people as shown below.
name
--------------
john doe
john david doe
doe henry john
john henry
I want to count the number of time each two words appear together in a name regardless of the order. In this example, the words john
and doe
appear in the same same three times names john doe
, john henry doe
and doe john
.
Expected output
name1 | name2 | count
----------------------
david | doe | 1
doe | henry | 1
doe | john | 3
henry | john | 2
Notice that name1
is the word that comes first in alphabetical order. Currently I have a brute force solution.
- Create a list of all unique words in the dataframe
- For each unique word
W
in this list, filter the records in the original data frame which contain thisW
- From the filtered records, count frequency of other words. This gives the number of time
W
appears with various other words
Question: This works fine for small number of records but is not efficient if we have a large number of records as it runs in quadratic complexity. How can it generate the output in a faster way? Is there any function or package that can give these counts?
Note: I tried using n-gram extraction from NLP packages but this over estimates the counts because it internally appends all the names to form a long string due to which the last word on a name and the first word of the next name shows up as a a sequence of words in the appended string which adds up to the count.
CodePudding user response:
Here's a solution which is still quadratic but over smaller n
and also hides most of it in compiled code (which should hopefully execute faster):
from itertools import combinations
combs = df['name'].apply(lambda x:list(combinations(sorted(x.split()),2)))
counts = Counter(combs.explode())
res = pd.Series(counts).rename_axis(['name1', 'name2']).reset_index(name='count')
Output for your sample data:
name1 name2 count
0 doe john 3
1 david doe 1
2 david john 1
3 doe henry 1
4 henry john 2
CodePudding user response:
I suggest the below:
import itertools
# 1) Create combinations of 2 from all the names using itertools().
l = [a for b in df.name.tolist() for a in b]
c = {c for c in itertools.combinations(l, 2) if c[0] != c[1]}
df_counts = pd.DataFrame(c, columns=["name1", "name2"])
# 2) Create a new column iterating through rows to check of each name is contained in each list of words & sum the boolean outputs.
df_counts["counts"] = df_counts.apply(lambda row: sum([row["name1"] in l and row["name2"] in l for l in df.name.to_list()]), axis=1)
I hope this helps.
CodePudding user response:
This problem is an instance of 'Association Rule Mining' with just 2 items per transaction. It has simple algorithms like 'Aperiori' as well as efficient ones like 'FP-Growth' and you can find a lot of resources to learn that.