Iterating over a dataframe, appending values to a dictionary until a value is NaN and then start a n-CodePudding

So I have a dataframe with three columns, the first has a value "x" which constantly goes up and down in a cycle like [1, 2, 3, 4, 5, 4, 3, 2, 1, 2, 3, .....]. The second is a column with the "time". The third is the "measurement", for most of the rows this is some float but near the start and the end of the cycle it becomes NaN for a few instances, not always on the same rows in terms of "x" though. An example of the data would be something like this: (sorry for the long code, I figured it'd be easy to have an example)

df = pd.DataFrame([[1, 'time1', ], [2, 'time2', ], [3, 'time3', 3], [4, 'time4', 5], [5, 'time5', 4], [6, 'time6', ], [5, 'time7', ], [4, 'time8', 3], [3, 'time9', 2], [2, 'time10', ], [1, 'time11', ], [2, 'time12', 4], [3, 'time13', 5], [4, 'time14', 6], [5, 'time15', 3], [6, 'time16', ], [5, 'time17', ], [4, 'time18', 3], [3, 'time19', 3]], columns=['x', 'time', 'measurement'])

I would like to iterate over this dataframe and create a new dictionary such that the key is the first value from the "time" column where the "measurement" is not a NaN, and as its values two lists of the "x" value and the corresponding "measurement" value. Then, each time a NaN is encountered as a "measurement", a new dictionary instance is to be started from the first next row where the "measurement" is no longer an NaN. For my example data the end result would look something like this:

nice_dict = {'time3':([3, 4, 5],[3, 5, 4]), 'time8':([4, 3],[3, 2]), 'time12':([2, 3, 4, 5],[4, 5, 6, 3]) .......}

Can anyone help me out with this problem?

CodePudding user response：

You can use a groupby operation on groups of successive non-NA values:

# identify non-NA rows
m = df['measurement'].notna()

# keep only non-NA and group by successive
# then extract the key and values to form the dictionary
nice_dict = {g['time'].iloc[0]: tuple(g[['x', 'measurement']]
                                      .T.convert_dtypes()
                                      .values.tolist())
             for _, g in df[m].groupby((~m).cumsum())}

output:

{'time3': ([3, 4, 5], [3, 5, 4]),
 'time8': ([4, 3], [3, 2]),
 'time12': ([2, 3, 4, 5], [4, 5, 6, 3]),
 'time18': ([4, 3], [3, 3])}

CodePudding user response：

We could do

m = df['measurement'].isna()
s = df[~m].groupby(m.cumsum()).agg(list)
d = dict(zip(s.time.str[0],s.drop(['time'],axis=1).values.tolist()))
Out[150]: 
{'time3': [[3, 4, 5], [3.0, 5.0, 4.0]],
 'time8': [[4, 3], [3.0, 2.0]],
 'time12': [[2, 3, 4, 5], [4.0, 5.0, 6.0, 3.0]],
 'time18': [[4, 3], [3.0, 3.0]]}