Home > Back-end >  Comparing row values in a dataframe
Comparing row values in a dataframe

Time:12-28

Ok so I'm trying to solve a challenge that says I have to add weight to a graph's edges. I am open to choosing what ever I want as my weight and I choose to add 1 each time I found a duplicated row in the Data frame. The problem is as in this data set: https://www.kaggle.com/datasets/csanhueza/the-marvel-universe-social-network?select=hero-network.csv (hero network) we have several rows with 2 columns, I have to find a way to compare each row with another one, and if I found a value more than once add a 1 as its weight. As for .duplicated() function, I know as a fact that there are more rows with the same 2 heroes connected to each other So my problem is I don't really know how to do it. any help would be great!

CodePudding user response:

If you want to count how many times every unique row appears in a DataFrame, it can be done by using groupby:

df.groupby(by=['hero1','hero2']).size()

Explanation: groupby creates groups of distinct rows; size() counts how many rows exist in each group.

A more general solution, works for every number of columns in a DataFrame:

df.groupby(df.columns.tolist()).size()

CodePudding user response:

I don't know if this is what you want but it seems to be what you are asking for. The code is "comparing rows" but it seems like what you are asking for is just summing the number of the same name in each column. I've included that as well using value_counts().

import numpy as np
import pandas as pd
#make a datafram with 2 columns, one and two, filled with letters. some match, some don't.
one=['a','b','c','d','a','a','a','b','f','g']
two=['a','b','c','a','b','c','g','g','f','f']
weight=np.ndarray.tolist(np.zeros(len(one)))
data1 = {'one': one, 'two': two,'weight': weight}
df1 = pd.DataFrame(data1)
#df1.iat[i,1]) ==> how to index down the right hand column
for i in range(len(one)):
    if df1.iat[i,1]==df1.iat[i,0]:
        df1.iat[i,2] = int(1)
    else:
        df1.iat[i,2]=None
print(df1)

df2=pd.DataFrame({'one': one, 'two': two})
print(df2['one'].value_counts())
  • Related