I have table with
|Majors|First Minor|Second Minor|
|Philosophy|English|Null|
|Math|Philosophy|English|
|English|Theology|Political Science
|Philosophy|Theology|English|
The possible values of the "First Minor" and "Second Minor" columns are the same (English, Philosophy, Math, etc)
I want to see the correlation between "Major" and the "Minors" belonging to those majors. I don't care if the minor is a first or second minor. So based on the above sample I want an output that looks something like:
|Major|Philosophy|Math|English
First Minor||||
Philosophy|0|1|0|
|Math|0|0|0|
|English|2|1|0|
Political Science|0|0|1|
Theology|1|0|1|
Null|1|0|0
The following code generates the desired result for just the "First Minor" or the "Second Minor":
ct = pd.crosstab(newdf.MajorInterMinor, newdf["First Minor"])
But I want to get the results for both columns ("First Minor" and "Second Minor") and I can't figure out how to get the combined result.
Any help/advice is much appreciated!
CodePudding user response:
melt
the Minors, then apply crosstab
:
df2 = df.melt(id_vars='Majors', value_name='Minors')
pd.crosstab(df2['Majors'], df2['Minors'])
Output:
Minors English Null Philosophy Political Science Theology
Majors
English 0 0 0 1 1
Math 1 0 1 0 0
Philosophy 2 1 0 0 1
CodePudding user response:
pd.melt to collapse FirstMinor
, SecondMinor
into one column and make new df
df2 = pd.melt(df,id_vars=['Majors'], value_vars=['FirstMinor','SecondMinor'],var_name='myVarname', value_name='FirstMinor_SecondMinor')
Slice new df to get_dummies on Majors. Join this back to the df. Groupby and sum
pd.DataFrame(df2.loc[:,'FirstMinor_SecondMinor']).join(pd.get_dummies(df2.loc[:,'Majors'])).groupby('FirstMinor_SecondMinor').agg(sum)
English Math Philosophy
FirstMinor_SecondMinor
English 0 1 2
Null 0 0 1
Philosophy 0 1 0
PoliticalScience 1 0 0
Theology 1 0 1