Creating correlation of multiple non numeric columns in Python-CodePudding

I have a very long data frame, I want to know the correlations of each and every element, the actual data frame is about 40 columns wide, so I am writing this example here.

Identifier	Column1	Column2	Column3	Column4
1	Dog	Cow	Sheep	Dinosaur
2	Dog	Pig
3	Bull		Elephant	Boar

What I want is 3 new columns that would look like this

Identifier	Var1	Var2
1	Dog	Cow
1	Dog	Sheep
1	Dog	Dinosaur
1	Cow	Sheep
1	Cow	Dinosaur
1	Sheep	Dinosaur
2	Dog	Pig
3	Bull	Elephant
3	Bull	Boar
3	Elephant	Boar

Eventually the reason I need them like this is to create a correlation matrix in tableau (It cannot be done in any other software)

If you guys know any other way to create a correlation matrix of all these variables, please let me know, this is the best way I found out but I still need to create this new table anyway.

And an extra to make it more difficult and interesting! Sometimes some of the columns will be empty, if that happens then it should stop and not continue.

Thanks in advance

CodePudding user response：

from itertools import combinations
data = pd.DataFrame(df.apply(lambda x: [x.Identifier, list(combinations(x.drop('Identifier'), 2))], axis=1).tolist()).explode(1)
df2 = pd.DataFrame(data[1].tolist(), index=data[0]).dropna(thresh=2)
df2.columns = ["Var1","Var2"]
df2.reset_index(inplace=True)
df2.rename(columns={0:"Identifier"}, inplace=True)
df2

output:

    Identifier  Var1    Var2
0   1   Dog Cow
1   1   Dog Sheep
2   1   Dog Dinosaur
3   1   Cow Sheep
4   1   Cow Dinosaur
5   1   Sheep   Dinosaur
6   2   Dog Pig
7   3   Bull    Elephant
8   3   Bull    Boar
9   3   Elephant    Boar