I'm good in Python but new to Pandas and know almost nothing about stats so forgive me if this is a simple or ignorant question.
Say I have a dataframe with two columns like, e.g., Jobs
and Cars
where the entries in both are one of a finite set of strings, e.g. [Software Engineer, Sysadmin, Product Manager]
and [Tesla, Hummer, Ford Focus]
.
I want to produce a table of some sort showing the correlation between jobs and cars, like this:
| | Tesla | Hummer | Ford Focus |
| SWE | ### | ### | ### |
| SA | ### | ### | ### |
| PM | ### | ### | ### |
What's the most pythonic way of doing this? Honestly this is a one off query and my data set is pretty small so it doesn't need to be the most efficient.
Edit: A sample dataframe could be generated like this
from random import choice
jobs = ['SWE', 'Data Scientist', 'Product Manager', 'Sysadmin', 'Data Engineer']
cars = ['Tesla', 'Hummer', 'Ford Focus', 'Chevy Volt', 'Toyota Tercel']
df = pd.DataFrame({
'jobs': [choice(jobs) for _ in range(1000) ],
'cars': [choice(cars) for _ in range(1000) ]
})
The expected output would be similar to that of DataFrame.corr()
but that function only operates on numbers and these are strings.
CodePudding user response:
You are looking for crosstabs
pd.crosstab(df['jobs'], df['cars'], normalize = True)