Pandas : Pivot row into column-CodePudding

The following is a minimal example of my data:

   Id name class_cd class_name
0   1    A     abc1        dog
1   1    A     def2     canine
2   1    A     ghi1       safe
3   2    B     abc1        cat
4   2    B     def2      tabby

Can be reproduced with:

df = pd.DataFrame({
    'Id': [1, 1, 1, 2, 2],
    'name':['A', 'A', 'A', 'B', 'B'],
    'class_cd': ['abc1', 'def2', 'ghi1', 'abc1', 'def2'],
    'class_name': ['dog', 'canine', 'safe', 'cat', 'tabby']
})

I want the class_cd distinct values to become new columns, where the value is the associated class_name, such that the result contains one row for each id.

Expected outcome:

    Id  name    abc1    def2    ghi1
0   1      A     dog  canine    safe
1   2      B     cat   tabby

How could one achieve this with Pandas?

CodePudding user response：

You can try:

(df.pivot(index=['Id', 'name'], columns='class_cd', values='class_name')
 .fillna('')
 .reset_index())

class_cd  Id name abc1    def2  ghi1
0          1    A  dog  canine  safe
1          2    B  cat   tabby

CodePudding user response：

This is a job for pivot.

You tell it which columns you want to expand, and what values to put in those new columns. It will use unique values from the specified index to create the rows in the result.

>>> df.pivot(index=['Id','name'], columns='class_cd', values='class_name')
class_cd abc1    def2  ghi1
Id name
1  A      dog  canine  safe
2  B      cat   tabby   NaN

Then, you can call reset_index() to flatten the multi-index into columns.

class_cd  Id name abc1    def2  ghi1
0          1    A  dog  canine  safe
1          2    B  cat   tabby   NaN

CodePudding user response：

As an altervative using crosstab:

dfx=pd.crosstab([df['Id'],df['name']], df['class_cd'],values=df['class_name'],aggfunc=','.join)

Output:

          abc1    def2  ghi1
Id name                    
1  A      dog  canine  safe
2  B      cat   tabby   NaN

CodePudding user response：

I would like to do it by pandas and sql 1.import sql

!pip install pandasql

from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())

take out dataframe from class_cd

df1=df[df['class_cd']=='abc1']
df2=df[df['class_cd']=='def2']
df3=df[df['class_cd']=='ghi3']

Use sql to join three tables

query="""
select tt1.Id, tt1.name, tt1.abc1,tt1.def2, t3.class_name as 'ghi3'
from
(select t1.Id,t1.name,t1.class_name as 'abc1', t2.class_name as 'def2'
from df1 as t1 
join df2 as t2 
on t1.name=t2.name) as tt1

left join df3 as t3
on tt1.name = t3.name 

"""

4.outcome

df_result=pysqldf(query)
print(df_result)