I have a very long data frame, I want to know the correlations of each and every element, the actual data frame is about 40 columns wide, so I am writing this example here.
Identifier | Column1 | Column2 | Column3 | Column4 |
---|---|---|---|---|
1 | Dog | Cow | Sheep | Dinosaur |
2 | Dog | Pig | ||
3 | Bull | Elephant | Boar |
What I want is 3 new columns that would look like this
Identifier | Var1 | Var2 |
---|---|---|
1 | Dog | Cow |
1 | Dog | Sheep |
1 | Dog | Dinosaur |
1 | Cow | Sheep |
1 | Cow | Dinosaur |
1 | Sheep | Dinosaur |
2 | Dog | Pig |
3 | Bull | Elephant |
3 | Bull | Boar |
3 | Elephant | Boar |
Eventually the reason I need them like this is to create a correlation matrix in tableau (It cannot be done in any other software)
If you guys know any other way to create a correlation matrix of all these variables, please let me know, this is the best way I found out but I still need to create this new table anyway.
And an extra to make it more difficult and interesting! Sometimes some of the columns will be empty, if that happens then it should stop and not continue.
Thanks in advance
CodePudding user response:
from itertools import combinations
data = pd.DataFrame(df.apply(lambda x: [x.Identifier, list(combinations(x.drop('Identifier'), 2))], axis=1).tolist()).explode(1)
df2 = pd.DataFrame(data[1].tolist(), index=data[0]).dropna(thresh=2)
df2.columns = ["Var1","Var2"]
df2.reset_index(inplace=True)
df2.rename(columns={0:"Identifier"}, inplace=True)
df2
output:
Identifier Var1 Var2
0 1 Dog Cow
1 1 Dog Sheep
2 1 Dog Dinosaur
3 1 Cow Sheep
4 1 Cow Dinosaur
5 1 Sheep Dinosaur
6 2 Dog Pig
7 3 Bull Elephant
8 3 Bull Boar
9 3 Elephant Boar