The following is a minimal example of my data:
Id name class_cd class_name
0 1 A abc1 dog
1 1 A def2 canine
2 1 A ghi1 safe
3 2 B abc1 cat
4 2 B def2 tabby
Can be reproduced with:
df = pd.DataFrame({
'Id': [1, 1, 1, 2, 2],
'name':['A', 'A', 'A', 'B', 'B'],
'class_cd': ['abc1', 'def2', 'ghi1', 'abc1', 'def2'],
'class_name': ['dog', 'canine', 'safe', 'cat', 'tabby']
})
I want the class_cd
distinct values to become new columns, where the value is the associated class_name
, such that the result contains one row for each id
.
Expected outcome:
Id name abc1 def2 ghi1
0 1 A dog canine safe
1 2 B cat tabby
How could one achieve this with Pandas?
CodePudding user response:
You can try:
(df.pivot(index=['Id', 'name'], columns='class_cd', values='class_name')
.fillna('')
.reset_index())
class_cd Id name abc1 def2 ghi1
0 1 A dog canine safe
1 2 B cat tabby
CodePudding user response:
This is a job for pivot
.
You tell it which columns you want to expand, and what values to put in those new columns. It will use unique values from the specified index to create the rows in the result.
>>> df.pivot(index=['Id','name'], columns='class_cd', values='class_name')
class_cd abc1 def2 ghi1
Id name
1 A dog canine safe
2 B cat tabby NaN
Then, you can call reset_index()
to flatten the multi-index into columns.
class_cd Id name abc1 def2 ghi1
0 1 A dog canine safe
1 2 B cat tabby NaN
CodePudding user response:
As an altervative using crosstab
:
dfx=pd.crosstab([df['Id'],df['name']], df['class_cd'],values=df['class_name'],aggfunc=','.join)
Output:
abc1 def2 ghi1
Id name
1 A dog canine safe
2 B cat tabby NaN
CodePudding user response:
I would like to do it by pandas and sql 1.import sql
!pip install pandasql
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
take out dataframe from class_cd
df1=df[df['class_cd']=='abc1']
df2=df[df['class_cd']=='def2']
df3=df[df['class_cd']=='ghi3']
- Use sql to join three tables
query="""
select tt1.Id, tt1.name, tt1.abc1,tt1.def2, t3.class_name as 'ghi3'
from
(select t1.Id,t1.name,t1.class_name as 'abc1', t2.class_name as 'def2'
from df1 as t1
join df2 as t2
on t1.name=t2.name) as tt1
left join df3 as t3
on tt1.name = t3.name
"""
4.outcome
df_result=pysqldf(query)
print(df_result)