I have measurement data from different sources which I'd like to convert to a DataFrame. However, the values from the two sources are not of the same kind:
data_in = [
[1.1, 'A', 1,2,3],
[1.2, 'B', 10,20,30,40],
[2.1, 'A', 1.1,2.1,3.1],
[2.1, 'B', 11,21,31,41],
[3.1, 'A', 1.2,2.2,3.2],
[3.2, 'B', 12,22,32,42],
]
pd.DataFrame(data_in)
Rather, the resulting DataFrame should look like this:
data_out = [
[1.1, 'A', 1,2,3],
[1.2, 'B', np.NaN,np.NaN,np.NaN,10,20,30,40],
[2.1, 'A', 1.1,2.1,3.1],
[2.1, 'B', np.NaN,np.NaN,np.NaN,11,21,31,41],
[3.1, 'A', 1.2,2.2,3.2],
[3.2, 'B', np.NaN,np.NaN,np.NaN,12,22,32,42],
]
pd.DataFrame(data_out, columns=['timestamp', 'source', 'val1', 'val2', 'val2', 'par1', 'par2', 'par3', 'par4'])
Of course, I could loop over the data and manually sort each row into a dedicated DataFrame and then merge them, but I wonder if there is a more efficient or at least "nicer" way to do this using pandas.
Thanks.
CodePudding user response:
You can do
df1 = df.copy()
df.iloc[:,2:] = df.iloc[:,2:].mask(df[1].eq('B'))
df1.iloc[:,2:] = df1.iloc[:,2:].where(df[1].eq('B'))
out = df.merge(df1, on = [0,1]).dropna(axis = 1, thresh = 1)
Out[298]:
0 1 2_x 3_x 4_x 2_y 3_y 4_y 5_y
0 1.1 A 1.0 2.0 3.0 NaN NaN NaN NaN
1 1.2 B NaN NaN NaN 10.0 20.0 30.0 40.0
2 2.1 A 1.1 2.1 3.1 NaN NaN NaN NaN
3 2.1 B NaN NaN NaN 11.0 21.0 31.0 41.0
4 3.1 A 1.2 2.2 3.2 NaN NaN NaN NaN
5 3.2 B NaN NaN NaN 12.0 22.0 32.0 42.0