I have a large collection of CSVs, each containing
- an "
id
" column of values, and - a "
neighbors
" column with lists of strings.
Values from id
do not occur in any neighbors
lists. Values in id
are unique per CSV, no CSV contains all the id
values, and if two CSV share an id
value, then the rows refer to the same object.
I would like to create something akin to a bipartite adjacency matrix from these CSVs, with a row for each each id
value i
, a column for each neighbors
string j
, and a 1
in cell (i,j)
if there in any of the CSVs exists a row with id
value i
where string j
occurs in the neighbors
list.
The code below does what it's supposed to, but it takes very long.
Is there a more effective way of getting the job done?
import pandas as pd
adjacency_matrix = pd.DataFrame()
list_of_csv_files = [csv_file1, csv_file2, ...]
for file in list_of_csvs:
df = pd.read_csv(file)
df.set_index('id', inplace = True)
for i in df.index:
for j in df.at[i,'neighbors']:
adjacency_matrix.at[i, j] = 1
Example:
Given that the csvs are loaded to the list list_of_dataframes = [df1,df2]
with
df1 = pd.DataFrame(data={'id':['11','12'], 'N': [['a'], ['a', 'b']]})
id neighbors
0 11 [a]
1 12 [a, b]
df2 = pd.DataFrame(data={'id':['11','13'], 'N': [['c'], ['d']]})
id neighbors
0 11 [c]
1 13 [d]
I seek the dataframe
a b c d
11 1 NaN 1 NaN
12 1 1 NaN NaN
13 NaN NaN NaN 1
CodePudding user response:
Yes there is a cleaner and efficient way using concat
, explode
and get_dummies
. You can try this:
import pandas as pd
list_of_csv_files = [csv_file1, csv_file2, ...]
list_of_dfs = [pd.read_csv(file) for file in list_of_csv_files]
out = pd.concat(list_of_dfs)
out = out.explode('neighbors').drop_duplicates(ignore_index=True)
out.set_index('id', inplace=True)
out = pd.get_dummies(out, prefix='', prefix_sep='')
out = out.groupby('id').sum()