Home > front end >  How to aggregate rows from a CSV, excluding ones based on a list of values
How to aggregate rows from a CSV, excluding ones based on a list of values

Time:10-17

I have a csv with the following data:

"id","Title","Author(s)","Format","Size","Tags"
"1","Horse","John","KFX","122","Classic"
"1","Horse","John","KFX","122","Drama"
"1","Horse","John","KFX","122","Horror"
"1","Horse","John","AZW3","122","Classic"
"1","Horse","John","AZW3","122","Drama"
"1","Horse","John","AZW3","122","Horror"
"1","Horse","John","PDF","122","Classic"
"1","Horse","John","PDF","122","Drama"
"1","Horse","John","PDF","122","Horror"
"2","Banana","Anna","AZW3","312","SciFi"
"2","Banana","Julia","AZW3","312","SciFi"
"2","Banana","Anna","PDF","312","SciFi"
"2","Banana","Julia","PDF","312","SciFi"

I would like to, using pandas, get this:

"id","Title","Author(s)","Format","Size","Tags"
"1","Horse","John","KFX","122","Classic, Drama, Horror"
"2","Banana","Anna, Julia","AZW3","312","SciFi"

Where it will exclude rows based in values of a list KFX, AZW3, PDF

If it has KFX, exclude the other two. If it doesn't have KFX, check for the next in the list, AZW3, if exist, exclude the PDF (if exists)

Then, aggregate Author(s) and Tags.

I am searching for ways of doing this and so far had no success. I'm trying to group by id, then exclude the rows based on list (KFX, AZW3, PDF), then aggregate.

I was able to aggregate using:

df.groupby(['id']).agg(lambda x: ','.join(x))

But there are two problems: First, it will include rows containing all "Format". Second, it will repeat other columns as well "1, 1, 1","Horse, Horse, Horse","John, John, John"

and so on.

CodePudding user response:

You can use custom functions to aggregate:

def uniq_str(l):
    return ', '.join(dict.fromkeys(map(str, l)))

def agg_format(l):
    s = set(l) # not necessary if only 1 comparison
    if 'KFX' in s:
        return 'KFX'
    else:
        exclude = {'PDF'}
        return next((x for x in l if x not in exclude), None)
    
df = pd.read_csv('file.csv')
    
agg = {c: uniq_str for c in df}
agg['Format'] = agg_format

import csv 
(df.groupby('id', as_index=False).agg(agg)
   .to_csv('out.csv', index=False, quoting=csv.QUOTE_ALL)
)

output:

"id","Title","Author(s)","Format","Size","Tags"
"1","Horse","John","KFX","122","Classic, Drama, Horror"
"2","Banana","Anna, Julia","AZW3","312","SciFi"
  • Related