My question is similar to this one, but still different. I have a list of triplets like the following, representing rows and columns of a matrix with their cell value:
a = [("g1","g2",7),("g1","g3",5)]
The matrix is symmetrical, so the elements can be provided in any order - meaning that ("g1","g2",7)
would imply ("g2","g1",7)
.
I would like to obtain a pandas df from this list, representing a matrix that has element names on the rows and columns, with missing values if a triplet is not listed in a
:
g1 g2 g3
g1 NaN 7 5
g2 7 NaN Nan
g3 5 NaN Nan
can you help me achieve this task in the most efficient way for huge lists?
CodePudding user response:
Maybe not the most elegant of all solution, but it does the job. For your list, identify all the "tags" you have (e.g., g1,g2,....gn).
import numpy as np
import pandas as pd
a = (("g1","g2",7),("g1","g3",5))
tags = []
for t1, t2, _ in a:
tags = [t1, t2]
tags = index = columns = sorted(list(set(tags)))
Once you,ve done that, put them in a list
tags = dict((t, i) for i, t in enumerate(tags))
Create an empty table of the dimension of that list, replace all the 0 values with nan
and then replace the nan
values corresponding to the right tag and make that into a dataframe
Table = np.empty((len(tags),len(tags)))
Table[:] = np.nan
for t1, t2, corr in a:
Table[tags[t1]][tags[t2]] = corr
Table[tags[t2]][tags[t1]] = corr
df = pd.DataFrame(Table, index=index, columns=columns)
which returns
g1 g2 g3
g1 NaN 7.0 5.0
g2 7.0 NaN NaN
g3 5.0 NaN NaN
CodePudding user response:
You're going to need to define an order for your matrix labels as well
import pandas as pd
a = [("g1","g2",7),("g1","g3",5)]
order = ['g1', 'g2', 'g3']
Pivot your data
a_df = pd.DataFrame(a).pivot(index=0, columns=1, values=2)
a_df.index.name = None
a_df.columns.name = None
Reindex it so that it's got the correct order on both axes
a_df = a_df.reindex(order, axis=1).reindex(order, axis=0)
Add the array to it's transpose, only where the array is NaN (in case you have any values on the diagonal)
a_df = a_df.add(a_df.T[pd.isna(a_df)], fill_value=0)
>>> a_df
1 g1 g2 g3
0
g1 NaN 7.0 5.0
g2 7.0 NaN NaN
g3 5.0 NaN NaN