Home > Back-end >  How to do I count the number of string combinations in each row of a pandas dataframe?
How to do I count the number of string combinations in each row of a pandas dataframe?

Time:07-21

I'm trying to count the number of times a combination of strings appear in each row of a dataframe. Each ID uses a number of methods (some IDs use more methods than others) and I want to count the number of times any two methods have been combined together.

# df is from csv and has blank cells - I've used empty strings to demo here
df = pd.DataFrame({'id': ['101', '102', '103', '104'],
    'method_1': ['HR', 'q-SUS', 'PEP', 'ET'],
    'method_2': ['q-SUS', 'q-IEQ', 'AUC', 'EEG'],
    'method_3': ['SC', '', 'HR', 'SC'],
    'method_4': ['q-IEQ', '', 'ST', 'HR'],
    'method_5': ['PEP', '', 'SC', '']})

print(df)

    id method_1 method_2 method_3 method_4 method_5
0  101       HR    q-SUS       SC    q-IEQ      PEP
1  102    q-SUS    q-IEQ                           
2  103      PEP      AUC       HR       ST       SC
3  104       ET      EEG       SC       HR         

I want to end up with a table that looks something like this: | Method A | Method B | Number of Times Combined| | :------: | :------: | :---------------------: | | HR | SC | 3 | | HR | q-SUS | 1 | | HR | PEP | 2 | | q-IEQ | q-SUS | 2 | | EEG | ET | 1 | | EEG | SC | 1 | | etc. | etc. | etc. |

So far I've been trying variations of this code using itertools.combinations and collections Counter:

import numpy as np
import pandas as pd
import itertools
from collections import Counter

def get_all_combinations_without_nan(row):
    # remove nan - this is for the blank csv cells
    set_without_nan = {value for value in row if isinstance(value, str)}

    # generate all combinations of values in row
    all_combinations = []
    for index, row in df.iterrows():  
        result = list(itertools.combinations(set_without_nan, 2))
        all_combinations.extend(result)

    return all_combinations

# get all possible combinations of values in a row
all_rows = df.apply(get_all_combinations_without_nan, 1).values
all_rows_flatten = list(itertools.chain.from_iterable(all_rows))

count_combinations = Counter(all_rows_flatten)

print(count_combinations)

It's doing something, but it seems to be counting multiple times or something (it's counting more combinations than are actually there. I've had a good look on Stack, but can't seem to solve this - everything seems really close though!

I hope someone can help - Thanks!

CodePudding user response:

Use DataFrame.melt for reshape with remove empty strings or missing values, then use DataFrame.merge for all combinations, remove rows with same methods and count by GroupBy.size:

df1 = df.melt('id', value_name='method_')
df1 = df1[(df1["method_"] != '') & (df1["method_"].notna())]

df = (df1.merge(df1, on='id', suffixes=('A','B'))
          .query("method_A != method_B")
          .groupby(['method_A','method_B'])
          .size()
          .reset_index(name='Number of Times Combined'))
print (df.head(20))
   method_A method_B  Number of Times Combined
0       AUC       HR                         1
1       AUC      PEP                         1
2       AUC       SC                         1
3       AUC       ST                         1
4       EEG       ET                         1
5       EEG       HR                         1
6       EEG       SC                         1
7        ET      EEG                         1
8        ET       HR                         1
9        ET       SC                         1
10       HR      AUC                         1
11       HR      EEG                         1
12       HR       ET                         1
13       HR      PEP                         2
14       HR       SC                         3
15       HR       ST                         1
16       HR    q-IEQ                         1
17       HR    q-SUS                         1
18      PEP      AUC                         1
19      PEP       HR                         2
    
  • Related