So i have data in the following dataframe:
id | hours1 | hours2 | id_2 | status | timeweeks | code1 | code2 | code3 | |
---|---|---|---|---|---|---|---|---|---|
858 | 2.087e 06 | 105 | 14 | 1.057e 06 | 1 | 96.7143 | nan | nan | 0.5 |
505 | 2.08697e 06 | 0 | 28 | 1.057e 06 | 1 | 111.143 | nan | 0.5 | nan |
245 | 2.08695e 06 | 32 | 431.5 | 1.057e 06 | 1 | 178.571 | nan | nan | 5 |
620 | 2.08698e 06 | 10 | 0 | 1.057e 06 | 1 | 103.571 | nan | 0.5 | nan |
177 | 1.90024e 06 | 60 | 57 | 1.057e 06 | 1 | 37.7143 | nan | nan | 0.5 |
828 | 2.08699e 06 | 112 | 0 | 1.057e 06 | 1 | 129.714 | nan | nan | 0.5 |
63 | 1.58152e 06 | 1252 | 65.25 | 1.057e 06 | 1 | 94 | nan | nan | 3 |
479 | 2.08697e 06 | 0 | 56 | 1.057e 06 | 1 | 62.4286 | nan | nan | 0 |
251 | 2.08695e 06 | 32 | 431.5 | 1.057e 06 | 1 | 178.571 | nan | 4 | nan |
673 | 2.08698e 06 | 0 | 7 | 1.057e 06 | 1 | 103.571 | nan | nan | 0.5 |
310 | 2.08695e 06 | 105 | 53 | 1.057e 06 | 1 | 58 | nan | nan | 0.5 |
336 | 2.08696e 06 | 77 | 77 | 1.057e 06 | 1 | 113.286 | nan | nan | 0.5 |
731 | 2.08699e 06 | 229.25 | 105.75 | 1.057e 06 | 1 | 116.286 | nan | 5 | nan |
72 | 1.58152e 06 | 1252 | 65.25 | 1.057e 06 | 1 | 94 | nan | nan | 0.5 |
800 | 2.08699e 06 | 112 | 0 | 1.057e 06 | 1 | 129.714 | nan | nan | 0.5 |
674 | 2.08698e 06 | 0 | 7 | 1.057e 06 | 1 | 103.571 | nan | nan | 0.5 |
402 | 2.08696e 06 | 0 | 7 | 1.057e 06 | 1 | 103.571 | nan | nan | 0.5 |
606 | 2.08698e 06 | 10 | 0 | 1.057e 06 | 1 | 103.571 | nan | nan | 0.5 |
804 | 2.08699e 06 | 112 | 0 | 1.057e 06 | 1 | 129.714 | nan | nan | 0.5 |
513 | 2.08697e 06 | 0 | 28 | 1.057e 06 | 1 | 111.143 | nan | 0.5 | nan |
and I basically want it to be in shape that in one row is data of one id. So in one row there would be only one value of the next columns: id, hours1, hours2, id_2, status and timeweeks. And then every code value of one id would be its own column. Or if its somehow possible, only the notnull value of each rows three code columns would be column. So the final dataframe should look like this:
id | hours1 | hours2 | id_2 | status | timeweeks | code1_1 | code2_1 | code3_1 | code3_2 | and so on.. |
---|---|---|---|---|---|---|---|---|---|---|
1 | 105 | 200 | 1 | 1 | 50 | 1 | 2 | 1 | 5 | |
2 | 300 | 40 | 1 | 1 | 33 | 3 | 4 | 1 | 0 | |
3 | 20 | 30 | 2 | 5 | 20 | 0 | 0.5 | 2 | 2 |
Don't really know if it's even possible this way, but I want to think it is.
So what I tried was turning them into dict and then after that back to dataframe.
I tested this:
df_test2 = df_2.groupby(['id','id2','hours1','hours2', 'status','timeweeks'])[['code1','code2','code3']].apply(lambda g: g.values.tolist()).to_dict()
and got result (one item):
{(1564719, 1057033.0, 407.5, 123.5, 99.71428406084657, 1.0): [[nan, nan, 0.5], [nan, nan, 1.0], [nan, nan, 4.0], [nan, nan, 2.0], [nan, nan, 2.0], [nan, nan, 2.0], [nan, nan, 4.0], [nan, nan, 2.0], [nan, nan, 3.0], [nan, nan, 2.0], [nan, nan, 1.0], [nan, nan, 4.0], [nan, nan, 5.0], [nan, nan, 2.0], [nan, nan, 4.0], [nan, nan, 2.0], [nan, nan, 2.0], [nan, nan, 2.0], [nan, nan, 2.0], [nan, nan, 2.0], [nan, 1.0, nan], [nan, nan, 0.5], [nan, nan, 0.5], [nan, nan, 0.5], [nan, nan, 0.5], [nan, nan, 0.5], [nan, nan, 0.5], [nan, nan, 0.5], [nan, 4.0, nan], [nan, nan, 0.5], [nan, nan, 0.5], [nan, nan, 0.5], [nan, nan, 0.5], [nan, nan, 0.5], [nan, nan, 0.5], [nan, nan, 0.5], [nan, 1.0, nan], [nan, nan, 0.5], [nan, nan, 0.5], [nan, nan, 0.5], [nan, nan, 0.5], [nan, nan, 0.5], [nan, nan, 0.5], [nan, nan, 0.5], [nan, 4.0, nan], [nan, nan, 0.5], [nan, nan, 0.5], [nan, nan, 0.5], [nan, nan, 0.5], [nan, nan, 0.5], [nan, nan, 0.5], [nan, nan, 0.5], [nan, nan, 0.5]]
after that put it to dataframe like this:
testframe = pd.DataFrame.from_dict(df_test3,orient='index')
It looks like this:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | 204 | 205 | 206 | 207 | 208 | 209 | 210 | 211 | 212 | 213 | 214 | 215 | 216 | 217 | 218 | 219 | 220 | 221 | 222 | 223 | 224 | 225 | 226 | 227 | 228 | 229 | 230 | 231 | 232 | 233 | 234 | 235 | 236 | 237 | 238 | 239 | 240 | 241 | 242 | 243 | 244 | 245 | 246 | 247 | 248 | 249 | 250 | 251 | 252 | 253 | 254 | 255 | 256 | 257 | 258 | 259 | 260 | 261 | 262 | 263 | 264 | 265 | 266 | 267 | 268 | 269 | 270 | 271 | 272 | 273 | 274 | 275 | 276 | 277 | 278 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
(1564719, 1057033.0, 407.5, 123.5, 99.71428406084657, 1.0) | [nan, nan, 0.5] | [nan, nan, 1.0] | [nan, nan, 4.0] | [nan, nan, 2.0] | [nan, nan, 2.0] | [nan, nan, 2.0] | [nan, nan, 4.0] | [nan, nan, 2.0] | [nan, nan, 3.0] | [nan, nan, 2.0] | [nan, nan, 1.0] | [nan, nan, 4.0] | [nan, nan, 5.0] | [nan, nan, 2.0] | [nan, nan, 4.0] | [nan, nan, 2.0] | [nan, nan, 2.0] | [nan, nan, 2.0] | [nan, nan, 2.0] | [nan, nan, 2.0] | [nan, 1.0, nan] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, 4.0, nan] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, 1.0, nan] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, 4.0, nan] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
(1581517, 1057003.0, 1252.0, 65.25, 93.99999834656084, 1.0) | [nan, nan, nan] | [nan, nan, 3.0] | [nan, nan, 3.0] | [nan, nan, 5.0] | [nan, nan, 3.0] | [nan, nan, 5.0] | [nan, nan, 5.0] | [nan, nan, 5.0] | [nan, nan, 3.0] | [nan, nan, 3.0] | [nan, nan, 3.0] | [nan, nan, 5.0] | [nan, 3.0, nan] | [nan, 3.0, nan] | [nan, 3.0, nan] | [nan, 3.0, nan] | [nan, 3.0, nan] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 0.5] | [nan, nan, 1.0] | [nan, 3.0, nan] | [nan, nan, 5.0] | [nan, nan, 3.0] | [nan, nan, 5.0] |
which is not exactly what I was aiming for. So if there is way to make dataframe look like in the example, that would be my goal.
CodePudding user response:
Use DataFrame.stack
for remove misisng values after DataFrame.set_index
, then create helper columns and level in MultiIndex
by GroupBy.cumcount
and last reshape by Series.unstack
:
cols = ['id','id_2','hours1','hours2', 'status','timeweeks']
df1 = df.set_index(cols).stack().to_frame('code')
df1 = df1.set_index(df1.groupby(df1.index).cumcount().add(1), append=True)['code'].unstack([-1,-2])
df1.columns = df1.columns.map(lambda x: f'{x[1]}_{x[0]}')
df1 = df1.reset_index()
print (df1)
id id_2 hours1 hours2 status timeweeks code3_1 \
0 1581520.0 1057000.0 1252.00 65.25 1 94.0000 3.0
1 1900240.0 1057000.0 60.00 57.00 1 37.7143 0.5
2 2086950.0 1057000.0 32.00 431.50 1 178.5710 5.0
3 2086950.0 1057000.0 105.00 53.00 1 58.0000 0.5
4 2086960.0 1057000.0 0.00 7.00 1 103.5710 0.5
5 2086960.0 1057000.0 77.00 77.00 1 113.2860 0.5
6 2086970.0 1057000.0 0.00 28.00 1 111.1430 NaN
7 2086970.0 1057000.0 0.00 56.00 1 62.4286 0.0
8 2086980.0 1057000.0 0.00 7.00 1 103.5710 0.5
9 2086980.0 1057000.0 10.00 0.00 1 103.5710 0.5
10 2086990.0 1057000.0 112.00 0.00 1 129.7140 0.5
11 2086990.0 1057000.0 229.25 105.75 1 116.2860 NaN
12 2087000.0 1057000.0 105.00 14.00 1 96.7143 0.5
code2_1 code3_2 code3_3 code2_2
0 NaN 0.5 NaN NaN
1 NaN NaN NaN NaN
2 4.0 NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 0.5 NaN NaN 0.5
7 NaN NaN NaN NaN
8 NaN 0.5 NaN NaN
9 0.5 NaN NaN NaN
10 NaN 0.5 0.5 NaN
11 5.0 NaN NaN NaN
12 NaN NaN NaN NaN