Filter out data with best combinations within two columns in pandas dataframe-CodePudding

I have a dataframe which looks like the below sample data

   col_1 col_2
0      1     A
1      1     B
2      1     C
3      2     A
4      2     B
5      3     C
6      3     D
7      4     D
8      4     A
9      4     A
10     5     B

I want filter out this data and find out the best combinations covering each value at least once.

For example in the above data the first combination is [1,A] and when we search further the next available combination is [1,B] but it's is giving me only one new value i.e. 'B' as '1' is already covered in the first combination. So we should search further before finalizing the second combination. On searching further we can find a better combination i.e. [2,B] giving us both the new values. This way we can search for further combinations. The expected output is:

  col_1 col_2
0     1     A
1     2     B
2     3     C
3     4     D
4     5     B

Whatever I tried to filter out this data didn't work for me.

Can anyone provide a solution or guide me in the right direction?

CodePudding user response：

My solution is:

l1=list()
l2=list()
l=list()
for i,j in df.iterrows():
    if j.col_1 not in l1 and j.col_2 not in l2:
        l1.append(j.col_1)
        l2.append(j.col_2)
        l.append([j.col_1,j.col_2])
pd.DataFrame(l , columns=["col_1","col_2"])

Output:

   col_1  col_2
0   1      A
1   2      B
2   3      C
3   4      D

CodePudding user response：

Iteratively adding pairs while using groupby to count the available unused options, commented inline for clarity. Superior to other iterative approaches as the loop has finishes in max df['col_2'].nunique() iterations, possibly faster.

df = pd.DataFrame({
    'col_1': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 3, 6: 3, 7: 4, 8: 4, 9: 4, 10: 5},
    'col_2': {0: 'A', 1: 'B', 2: 'C', 3: 'A', 4: 'B', 5: 'C', 6: 'D', 7: 'D', 8: 'A', 9: 'A', 10: 'B'},
})
pairs = pd.DataFrame(columns=df.columns)

# Iterate and get all col_1, col_2 pair such that the combination is optimal
# - col_2 not yet in pairs
# - col_1 associated uniquely (or minimally) to that col_2, once previous combinations were removed
while len(df[~df['col_2'].isin(pairs['col_2'])]):
    options = df[~df['col_2'].isin(pairs['col_2'])].groupby('col_1')['col_2'].unique()
    n_options = options.str.len()
    if n_options.eq(1).any():
        add = options.index[n_options.eq(1)]
    else:
        add = [n_options.idxmin()]
    pairs = pairs.append(options[add].reset_index().explode('col_2'), ignore_index=True)

# Now all possible col_2 are in pairs, add missing col_1 with any col_2
pairs = pairs.append(df[~df['col_1'].isin(pairs['col_1'])].drop_duplicates('col_1'), ignore_index=True)

print(pairs)

This returns the following:

  col_1 col_2
0     5     B
1     2     A
2     1     C
3     4     D
4     3     C

CodePudding user response：

df = df.pivot(columns='col_2', values='col_1')
desired_df = []
last_uniques = []
for col in df.columns:
    uniques = df[col].dropna().unique()
    desired = list(set(uniques) - set(last_uniques))
    desired_df.append({'col_1': desired[0], 'col_2': col})
    last_uniques.append(desired[0])

for col in df.columns:
    uniques = list(set(df[col].dropna()) - set(last_uniques))
    if uniques:
        desired_df.append({'col_1': uniques[0], 'col_2': col})
        last_uniques.append(uniques[0])

df_final = pd.DataFrame(data=desired_df)

OUTPUT:

   col_1 col_2
0      1     A
1      2     B
2      3     C
3      4     D
4      5     B

CodePudding user response：

I think that the best way to achieve this is with an iterative algorithm.

First, you need to extract columns, then you define a get_score function to understand how much value a combination has, given an history of combinations already set as valuable.

Last, you append either combinations that are good (score == MAX_SCORE) or the last combination, as in your example (i == last_index).

In the end you will obtain a list of combinations (or tuples), that you can easily cast back to a DataFrame.

# define max score of a combination
MAX_SCORE = 2

# extract columns as lists
x, y = df.col_1.to_list(), df.col_2.to_list()

# function to compute score
def get_score(combination, old_combinations):
    x, y = combination
    if not old_combinations:
        return MAX_SCORE
    
    # split a list of two-sized tuples in two lists 
    xs, ys = [*zip(*old_combinations)]
    score = MAX_SCORE
    if x in xs:
        score -= 1
    if y in ys:
        score -= 1
    return score


# get list of pairs
iterator = list(zip(x, y))
last_index = len(iterator) - 1
combinations = []

# algorithm
for i, combination in enumerate(iterator):
    score = get_score( combination, combinations )
    if score == MAX_SCORE or i == last_index:
        combinations.append( combination )

# result
combinations  # [(1, 'A'), (2, 'B'), (3, 'C'), (4, 'D'), (5, 'B')]

new_df = pd.DataFrame(combinations, columns=["col_1", "col_2"])
print(new_df)

This prints the following:

   col_1 col_2
0      1     A
1      2     B
2      3     C
3      4     D
4      5     B