New learner here. I have a list of data values that are labeled by a comma-delimited string that represents the position in a dataframe; think of the string as representing the row (say 1-20) and column (say A-L) index values of a position in the array where the corresponding value should go. The populated data frame would be sparse, with many empty cells. I am working with pandas for the first time on this project, and am still learning the ropes.
position value
1,A 32
1,F 16
2,B 234
2,C 1345
2,E 13
2,G 999
3,D 5332
4,B 12
etc.
I have been trying various approaches, but am not satisfied. I created dummy entries for empty cells in the completed dataframe, then iterated over the list to write the value to the correct cell. It works but it is not elegant and it seems like a brittle solution.
I can pre-generate a dataframe and populate it, or generate a new dataframe as part of the population process: either solution would be fine. It seems like this should be a simple task. Maybe even a one liner! But I am stumped. I would appreciate any pointers.
CodePudding user response:
As you suggest, the simplest method might be a for-loop to initialize the non-empty values. You can also use numpy advanced indexing to overwrite the underlying array data. Both are shown below.
text = """\
row,column,value
1,A 32
1,F 16
2,B 234
2,C 1345
2,E 13
2,G 999
3,D 5332
4,B 12
"""
from io import StringIO
import numpy as np
import pandas as pd
# Load your data and convert the column letters to integers.
# Note: Your exapmple data is delimited with both spaces and commas,
# which is why we need a custom 'sep' argument here.
entries = pd.read_csv(StringIO(text), sep='[ ,] ')
entries['icol'] = entries['column'].map(lambda c: ord(c) - ord('A'))
# Construct an empty DataFrame with the appropriate index and columns.
rows = range(1, 1 entries['row'].max())
columns = [chr(ord('A') i) for i in range(1 entries['icol'].max())]
df = pd.DataFrame(index=rows, columns=columns)
# Populate the DataFrame.
for e in entries.itertuples():
df.loc[e.row, e.column] = e.value
# Alternatively, use numpy advanced indexing
# to overwrite the underlying array data:
irows = entries['row'].values - 1
icols = entries['icol'].values
df.values[irows, icols] = entries['value'].values
Result:
A B C D E F G
1 32 NaN NaN NaN NaN 16 NaN
2 NaN 234 1345 NaN 13 NaN 999
3 NaN NaN NaN 5332 NaN NaN NaN
4 NaN 12 NaN NaN NaN NaN NaN
CodePudding user response:
This is a standard unstack
:
entries.set_index(['row','column']).unstack()
where entries
is defined in @StuartBerg answer:
entries = pd.read_csv(StringIO(text), sep='[ ,] ')
output:
value
column A B C D E F G
row
1 32.0 NaN NaN NaN NaN 16.0 NaN
2 NaN 234.0 1345.0 NaN 13.0 NaN 999.0
3 NaN NaN NaN 5332.0 NaN NaN NaN
4 NaN 12.0 NaN NaN NaN NaN NaN