For example, let's say that I have the dataframe:
NAME = ['BOB', 'BOB', 'BOB', 'SUE', 'SUE', 'MARY', 'JOHN', 'JOHN', 'MARK', 'MARK', 'MARK', 'MARK']
STATE = ['CA','CA','CA','DC','DC','PA','GA','GA','NY','NY','NY','NY']
MAJOR = ['MARKETING','BUSINESS ADM',np.nan,'ECONOMICS','MATH','PSYCHOLOGY','HISTORY','BUSINESS ADM','MATH', 'MEDICAL SCIENCES',np.nan,np.nan]
SCHOOL = ['UCLA','UCSB','CAL STATE','HARVARD','WISCONSIN','YALE','CHICAGO','MIT','UCSD','UCLA','CAL STATE','COMMUNITY']
data = {'NAME':NAME, 'STATE':STATE,'MAJOR':MAJOR, 'SCHOOL':SCHOOL}
df = pd.DataFrame(data)
I am to concatenate rows with multiple unique values for the same name.
I tried:
gr_columns = [x for x in df1.columns if x not in ['MAJOR','SCHOOL']]
df1 = df1.groupby(gr_columns).agg(lambda col: '|'.join(col))
and expected
I am trying to concatenate rows in columns where the NAME field is the same. Conveniently, the STATE field is static for each NAME. I would like the output to look like:
NAME | STATE | MAJOR | SCHOOL |
---|---|---|---|
BOB | CA | MARKETING,BUSINESS ADM | UCLA,UCSB,CAL STATE |
SUE | DC | ECONOMICS,MATH | HARVARD,WISCONSIN |
MARY | PA | PSYCHOLOGY | YALE |
JOHN | GA | HISTORY,BUSINESS ADM | CHICAGO,MIT |
MARK | NY | MATH,MEDICAL SCIENCES | UCSD,UCLA,CAL STATE,COMMUNITY |
but instead, I get a single column containing the concatenated schools.
CodePudding user response:
It is because your np.nan
cannot be converted to str, so it is dropped automatically by pandas. You need to convert its type to str first:
df.groupby(['NAME', 'STATE']).agg(lambda x: ','.join(x.astype(str)))
To drop na and keep NAME and STATE as columns:
df.groupby(['NAME', 'STATE']).agg(lambda x: ','.join(x.dropna())).reset_index()