I am analyzing Amazon's reviews dataset, and I have, customers IDs, their reviews on different products, and products' identifiers as well.
The data can be represented by:
Customer | Product | Review | ... |
---|---|---|---|
1 | A | .... | |
1 | B | .... | |
2 | A | .... | |
2 | C | .... |
I want to create a weighted undirected graph using networkx
, where each node would be a product, and the weights between nodes (products) would be the number of different customers that reviewed the two products.
The data is huge, so I was wondering if there is a feasible way to update the current weights of a network iteratively when going product by product.
Another desirable representation of this graph would be, for the example above,
A | B | C | |
---|---|---|---|
A | 2 | 1 | 1 |
B | 1 | 1 | 0 |
C | 1 | 0 | 1 |
EDIT: Mistakenly wrote the (A,C)=2
. Replaced it with 1.
CodePudding user response:
Try this
import pandas as pd
df = pd.read_csv('file.csv')
# cross-tabulate
v = pd.crosstab(df['Product'], df['Customer'])
# dot product for the number of customers who reviewed 2 products
v.dot(v.T)
Product A B C
Product
A 2 1 1
B 1 1 0
C 1 0 1