Home > Software engineering >  Problems with concatenating rows using pd.groupby().agg()
Problems with concatenating rows using pd.groupby().agg()

Time:11-04

For example, let's say that I have the dataframe:

NAME = ['BOB', 'BOB', 'BOB', 'SUE', 'SUE', 'MARY', 'JOHN', 'JOHN', 'MARK', 'MARK', 'MARK', 'MARK']
    STATE = ['CA','CA','CA','DC','DC','PA','GA','GA','NY','NY','NY','NY']
    MAJOR = ['MARKETING','BUSINESS ADM',np.nan,'ECONOMICS','MATH','PSYCHOLOGY','HISTORY','BUSINESS ADM','MATH', 'MEDICAL SCIENCES',np.nan,np.nan]
    SCHOOL = ['UCLA','UCSB','CAL STATE','HARVARD','WISCONSIN','YALE','CHICAGO','MIT','UCSD','UCLA','CAL STATE','COMMUNITY']
    data = {'NAME':NAME, 'STATE':STATE,'MAJOR':MAJOR, 'SCHOOL':SCHOOL}
    df = pd.DataFrame(data)

I am to concatenate rows with multiple unique values for the same name.

I tried:

  gr_columns = [x for x in df1.columns if x not in ['MAJOR','SCHOOL']]
    df1 = df1.groupby(gr_columns).agg(lambda col: '|'.join(col))

and expected

I am trying to concatenate rows in columns where the NAME field is the same. Conveniently, the STATE field is static for each NAME. I would like the output to look like:

NAME STATE MAJOR SCHOOL
BOB CA MARKETING,BUSINESS ADM UCLA,UCSB,CAL STATE
SUE DC ECONOMICS,MATH HARVARD,WISCONSIN
MARY PA PSYCHOLOGY YALE
JOHN GA HISTORY,BUSINESS ADM CHICAGO,MIT
MARK NY MATH,MEDICAL SCIENCES UCSD,UCLA,CAL STATE,COMMUNITY

but instead, I get a single column containing the concatenated schools.

CodePudding user response:

It is because your np.nan cannot be converted to str, so it is dropped automatically by pandas. You need to convert its type to str first:

df.groupby(['NAME', 'STATE']).agg(lambda x: ','.join(x.astype(str)))

To drop na and keep NAME and STATE as columns:

df.groupby(['NAME', 'STATE']).agg(lambda x: ','.join(x.dropna())).reset_index()
  • Related