I have a dataframe as follows
Cycle | A_0 | A_1 | A_2 | A_3 | B_0 | B_1 | B_2 | B_3 |
---|---|---|---|---|---|---|---|---|
1 | 3 | 4 | 5 | 6 | 1 | 4 | 5 | |
1 | 8 | 5 | 3 | 1 | 0 | 8 | 6 | 4 |
2 | 7 | 9 | 1 | 6 | 1 | 0 | 2 | 3 |
3 | 5 | 9 | 1 | 0 | 3 | 8 | 3 |
this dataframe has to combined to two column A and B
Expected output
Cycle | A | B |
---|---|---|
1 | 3 | |
1 | 4 | 1 |
1 | 5 | 4 |
1 | 6 | 5 |
1 | 8 | 0 |
1 | 5 | 8 |
1 | 3 | 6 |
1 | 1 | 4 |
2 | 7 | 1 |
2 | 9 | 0 |
2 | 1 | 2 |
2 | 6 | 3 |
3 | 5 | 0 |
3 | 3 | |
3 | 9 | 8 |
3 | 1 | 3 |
What i did?
A = [f"A_{i}" for i in range(20)]
B = [f"B_{i}" for i in range(20)]
df2['A'] = df[A].bfill(axis=1).iloc[:, 0]
df2['B'] = df[B].bfill(axis=1).iloc[:, 0]
This line of code is givng me an output datframe by avoiding the nan. How can i get the expected output?
ADDON
added a new colum to the initial data and expected outcome
CodePudding user response:
You could use a dictionary comprehension after splitting the column names on the _
:
In [1]: import pandas as pd
...:
...: d = {
...: 'A_0': ['3', '8', '7', '5'],
...: 'A_1': ['4', '5', '9', ''],
...: 'A_2': ['5', '3', '1', '9'],
...: 'A_3': ['6', '1', '6', '1'],
...: 'B_0': ['', '0', '1', '0'],
...: 'B_1': ['1', '8', '0', '3'],
...: 'B_2': ['4', '6', '2', '8'],
...: 'B_3': ['5', '4', '3', '3'],
...: }
...:
...: df = pd.DataFrame(data=d)
...: df
Out[1]:
A_0 A_1 A_2 A_3 B_0 B_1 B_2 B_3
0 3 4 5 6 1 4 5
1 8 5 3 1 0 8 6 4
2 7 9 1 6 1 0 2 3
3 5 9 1 0 3 8 3
In [2]: df = pd.DataFrame({
...: key: pd.Series(vals.to_numpy().ravel())
...: for key, vals in df.groupby(lambda col: col.split('_')[0], axis=1)
...: })
...: df
Out[2]:
A B
0 3
1 4 1
2 5 4
3 6 5
4 8 0
5 5 8
6 3 6
7 1 4
8 7 1
9 9 0
10 1 2
11 6 3
12 5 0
13 3
14 9 8
15 1 3
CodePudding user response:
code part
columns = pd.Index(['A_0', 'A_1', 'A_2', 'A_3', 'B_0', 'B_1', 'B_2', 'B_3'], dtype='string')
values = np.array([[ 3., 4., 5., 6., np.nan, 1., 4., 5.],
[ 8., 5., 3., 1., 0., 8., 6., 4.],
[ 7., 9., 1., 6., 1., 0., 2., 3.],
[ 5., np.nan, 9., 1., 0., 3., 8., 3.]],
dtype=float)
## Or retrive from raw DataFrame if already exists
# columns = df_raw.columns
# values = df_raw.values
## Construct MultiIndex
mi = pd.MultiIndex.from_tuples((s.split("_") for s in columns))
## Construct DataFrame
df = pd.DataFrame(values, columns=mi)
## reshape: stack level=1 (2nd row) of columns to index
df_result = df.stack(level=1)
>>> df_result
A B
0 0 3.0 NaN
1 4.0 1.0
2 5.0 4.0
3 6.0 5.0
1 0 8.0 0.0
1 5.0 8.0
2 3.0 6.0
3 1.0 4.0
2 0 7.0 1.0
1 9.0 0.0
2 1.0 2.0
3 6.0 3.0
3 0 5.0 0.0
1 NaN 3.0
2 9.0 8.0
3 1.0 3.0
Explain
Steps:
Construct MultiIndex from flat Index
Pandas provides 4 builtin method to construct MultiIndex; Here use from_tuples form doc: https://pandas.pydata.org/docs/reference/api/pandas.MultiIndex.from_tuples.html
from_arrays :: input
[[x1, x2, ...], [y1, y2, ...]]
output[[x1, y1], [x2, y2], ...]
from_tuples :: input
[[x1, y1], [x2, y2], ...]
output samefrom_frame :: Transfer
DataFrames.values
toMultiIndex
from_product :: input like arrays, but zip them to output. e.g. input
[[x1, x2], [y1, y2, y3]]
output
MultiIndex([('x1', 'y1'), ('x1', 'y2'), ('x1', 'y3'), ('x2', 'y1'), ('x2', 'y2'), ('x2', 'y3')], )
Construct new DataFrame and reshape by stack
See User Guide on reshape/pivot topic: doc: https://pandas.pydata.org/docs/user_guide/reshaping.html
CodePudding user response:
If you do
as_ = df[[c for c in df.columns if c.startswith('A_')]]
bs_ = df[[c for c in df.columns if c.startswith('B_')]]
then you get the dataframes of the As and Bs
Now you can use
pd.DataFrame({
'A': as_.values.flatten(),
'B': bs_.values.flatten(),
'Cycle': df.Cycle.repeat(len(as_.columns))
})
which uses the flattened values, as well as repeat
.
There is nothing special about nan values here.