Home > Blockchain >  Calculating correlation of two text columns in Pandas
Calculating correlation of two text columns in Pandas

Time:04-26

I'm good in Python but new to Pandas and know almost nothing about stats so forgive me if this is a simple or ignorant question.

Say I have a dataframe with two columns like, e.g., Jobs and Cars where the entries in both are one of a finite set of strings, e.g. [Software Engineer, Sysadmin, Product Manager] and [Tesla, Hummer, Ford Focus].

I want to produce a table of some sort showing the correlation between jobs and cars, like this:

|     | Tesla | Hummer | Ford Focus |
| SWE | ###   | ###    | ###        |
| SA  | ###   | ###    | ###        |
| PM  | ###   | ###    | ###        |

What's the most pythonic way of doing this? Honestly this is a one off query and my data set is pretty small so it doesn't need to be the most efficient.

Edit: A sample dataframe could be generated like this

from random import choice

jobs = ['SWE', 'Data Scientist', 'Product Manager', 'Sysadmin', 'Data Engineer']
cars = ['Tesla', 'Hummer', 'Ford Focus', 'Chevy Volt', 'Toyota Tercel']

df = pd.DataFrame({
    'jobs': [choice(jobs) for _ in range(1000) ], 
    'cars': [choice(cars) for _ in range(1000) ]
})

The expected output would be similar to that of DataFrame.corr() but that function only operates on numbers and these are strings.

CodePudding user response:

You are looking for crosstabs

pd.crosstab(df['jobs'], df['cars'], normalize = True)
  • Related