Home > Mobile >  Generate a sparse dataframe from a list of index positions and values
Generate a sparse dataframe from a list of index positions and values

Time:02-24

New learner here. I have a list of data values that are labeled by a comma-delimited string that represents the position in a dataframe; think of the string as representing the row (say 1-20) and column (say A-L) index values of a position in the array where the corresponding value should go. The populated data frame would be sparse, with many empty cells. I am working with pandas for the first time on this project, and am still learning the ropes.

position value

1,A 32

1,F 16

2,B 234

2,C 1345

2,E 13

2,G 999

3,D 5332

4,B 12

etc.

I have been trying various approaches, but am not satisfied. I created dummy entries for empty cells in the completed dataframe, then iterated over the list to write the value to the correct cell. It works but it is not elegant and it seems like a brittle solution.

I can pre-generate a dataframe and populate it, or generate a new dataframe as part of the population process: either solution would be fine. It seems like this should be a simple task. Maybe even a one liner! But I am stumped. I would appreciate any pointers.

CodePudding user response:

As you suggest, the simplest method might be a for-loop to initialize the non-empty values. You can also use numpy advanced indexing to overwrite the underlying array data. Both are shown below.

text = """\
row,column,value
1,A 32
1,F 16
2,B 234
2,C 1345
2,E 13
2,G 999
3,D 5332
4,B 12
"""
from io import StringIO
import numpy as np
import pandas as pd

# Load your data and convert the column letters to integers.
# Note: Your exapmple data is delimited with both spaces and commas,
#       which is why we need a custom 'sep' argument here.
entries = pd.read_csv(StringIO(text), sep='[ ,] ')
entries['icol'] = entries['column'].map(lambda c: ord(c) - ord('A'))

# Construct an empty DataFrame with the appropriate index and columns.
rows = range(1, 1   entries['row'].max())
columns = [chr(ord('A')   i) for i in range(1   entries['icol'].max())]
df = pd.DataFrame(index=rows, columns=columns)

# Populate the DataFrame.
for e in entries.itertuples():
    df.loc[e.row, e.column] = e.value

# Alternatively, use numpy advanced indexing
# to overwrite the underlying array data:
irows = entries['row'].values - 1
icols = entries['icol'].values
df.values[irows, icols] = entries['value'].values

Result:

     A    B     C     D    E    F    G
1   32  NaN   NaN   NaN  NaN   16  NaN
2  NaN  234  1345   NaN   13  NaN  999
3  NaN  NaN   NaN  5332  NaN  NaN  NaN
4  NaN   12   NaN   NaN  NaN  NaN  NaN

CodePudding user response:

This is a standard unstack:

entries.set_index(['row','column']).unstack()

where entries is defined in @StuartBerg answer:

entries = pd.read_csv(StringIO(text), sep='[ ,] ')

output:

    value
column  A    B      C        D      E     F     G
row                         
1       32.0 NaN    NaN     NaN     NaN   16.0  NaN
2       NaN  234.0  1345.0  NaN     13.0  NaN   999.0
3       NaN  NaN    NaN     5332.0  NaN   NaN   NaN
4       NaN  12.0   NaN     NaN     NaN   NaN   NaN
  • Related