Home > Blockchain >  Creating correlation of multiple non numeric columns in Python
Creating correlation of multiple non numeric columns in Python

Time:06-25

I have a very long data frame, I want to know the correlations of each and every element, the actual data frame is about 40 columns wide, so I am writing this example here.

Identifier Column1 Column2 Column3 Column4
1 Dog Cow Sheep Dinosaur
2 Dog Pig
3 Bull Elephant Boar

What I want is 3 new columns that would look like this

Identifier Var1 Var2
1 Dog Cow
1 Dog Sheep
1 Dog Dinosaur
1 Cow Sheep
1 Cow Dinosaur
1 Sheep Dinosaur
2 Dog Pig
3 Bull Elephant
3 Bull Boar
3 Elephant Boar

Eventually the reason I need them like this is to create a correlation matrix in tableau (It cannot be done in any other software)

If you guys know any other way to create a correlation matrix of all these variables, please let me know, this is the best way I found out but I still need to create this new table anyway.

And an extra to make it more difficult and interesting! Sometimes some of the columns will be empty, if that happens then it should stop and not continue.

Thanks in advance

CodePudding user response:

from itertools import combinations
data = pd.DataFrame(df.apply(lambda x: [x.Identifier, list(combinations(x.drop('Identifier'), 2))], axis=1).tolist()).explode(1)
df2 = pd.DataFrame(data[1].tolist(), index=data[0]).dropna(thresh=2)
df2.columns = ["Var1","Var2"]
df2.reset_index(inplace=True)
df2.rename(columns={0:"Identifier"}, inplace=True)
df2

output:

    Identifier  Var1    Var2
0   1   Dog Cow
1   1   Dog Sheep
2   1   Dog Dinosaur
3   1   Cow Sheep
4   1   Cow Dinosaur
5   1   Sheep   Dinosaur
6   2   Dog Pig
7   3   Bull    Elephant
8   3   Bull    Boar
9   3   Elephant    Boar
  • Related