Home > Back-end >  Pandas DataFrame split numbered columns into rows by column title
Pandas DataFrame split numbered columns into rows by column title

Time:10-27

I have a DataFrame that looks something like this:

df = pd.DataFrame({
    'A': [0, 1, 2, 3, 4],
    'B': ['a', 'b', 'c', 'd', 'e'],
    'V1': [0.0, 0.1, 0.2, 0.3, 0.4],
    'V2': [1.0, 1.1, 1.2, 1.3, 1.4],
    'V3': [2.0, 2.1, 2.2, 2.3, 2.4],
    'X': ['alpha', 'beta', 'gamma', 'delta', 'epsilon'],
})
   A  B   V1   V2   V3        X
0  0  a  0.0  1.0  2.0    alpha
1  1  b  0.1  1.1  2.1     beta
2  2  c  0.2  1.2  2.2    gamma
3  3  d  0.3  1.3  2.3    delta
4  4  e  0.4  1.4  2.4  epsilon

I'd like to use the number in the V columns to spread this out into a long-form table. The number in the column label (1 for V1, 2 for V2 etc) would become a new column value in a column named "V Number" or whatever and the value would be the only "V" in that row. Something like this (I've hidden the index here as I don't care about that):

A  B  V Number    V        X
0  a      1     0.0    alpha     # Old first row, V1 value
0  a      2     1.0    alpha     # Old first row, V2 value
0  a      3     2.0    alpha     # Old first row, V3 value
1  b      1     0.1     beta     # Old second row, V1 value
1  b      2     1.1     beta     # etc...
1  b      3     2.1     beta
2  c      1     0.2    gamma
2  c      2     1.2    gamma
2  c      3     2.2    gamma
3  d      1     0.3    delta
3  d      2     1.3    delta
3  d      3     2.3    delta
4  e      1     0.4  epsilon
4  e      2     1.4  epsilon
4  e      3     2.4  epsilon

In the real DataFrame there are over 40 "V" columns, over 100 other columns and several thousand rows, so a reasonable simple and fast method would be nice! In case it helps, the column names are easy to isolate (they're actually called e.g. Test Voltage (3)', but I shortened them for the purposes of the example) with something like [i for i in df.columns if 'Test Voltage' in i].

Has anyone got any ideas of a straightforward way to do this? I've tried searching for lots of methods, but keep just finding ways to split columns with lists in the cells.

CodePudding user response:

Try with wide_to_long

out = pd.wide_to_long(df,['V'],i=['A','B','X'],j='number').reset_index()
Out[23]: 
    A  B        X  number    V
0   0  a    alpha       1  0.0
1   0  a    alpha       2  1.0
2   0  a    alpha       3  2.0
3   1  b     beta       1  0.1
4   1  b     beta       2  1.1
5   1  b     beta       3  2.1
6   2  c    gamma       1  0.2
7   2  c    gamma       2  1.2
8   2  c    gamma       3  2.2
9   3  d    delta       1  0.3
10  3  d    delta       2  1.3
11  3  d    delta       3  2.3
12  4  e  epsilon       1  0.4
13  4  e  epsilon       2  1.4
14  4  e  epsilon       3  2.4

CodePudding user response:

Use melt:

>>> df.melt(id_vars=['A', 'B', 'X'], var_name='V Number', value_name='V')
    A  B        X V Number    V
0   0  a    alpha       V1  0.0
1   1  b     beta       V1  0.1
2   2  c    gamma       V1  0.2
3   3  d    delta       V1  0.3
4   4  e  epsilon       V1  0.4
5   0  a    alpha       V2  1.0
6   1  b     beta       V2  1.1
7   2  c    gamma       V2  1.2
8   3  d    delta       V2  1.3
9   4  e  epsilon       V2  1.4
10  0  a    alpha       V3  2.0
11  1  b     beta       V3  2.1
12  2  c    gamma       V3  2.2
13  3  d    delta       V3  2.3
14  4  e  epsilon       V3  2.4

CodePudding user response:

You can also use .stack(), as follows:

(df.set_index(['A', 'B', 'X'])
   .rename_axis(columns='V Number')
   .stack()
   .reset_index(name='V')
)

Result:

    A  B        X V Number    V
0   0  a    alpha       V1  0.0
1   0  a    alpha       V2  1.0
2   0  a    alpha       V3  2.0
3   1  b     beta       V1  0.1
4   1  b     beta       V2  1.1
5   1  b     beta       V3  2.1
6   2  c    gamma       V1  0.2
7   2  c    gamma       V2  1.2
8   2  c    gamma       V3  2.2
9   3  d    delta       V1  0.3
10  3  d    delta       V2  1.3
11  3  d    delta       V3  2.3
12  4  e  epsilon       V1  0.4
13  4  e  epsilon       V2  1.4
14  4  e  epsilon       V3  2.4

If you want the V Number column to have only the number, you can use:

df2 = (df.set_index(['A', 'B', 'X'])
         .rename_axis(columns='V Number')
         .stack()
         .reset_index(name='V')
      )

df2['V Number'] = df2['V Number'].str[1:]

Result:

print(df2)

    A  B        X V Number    V
0   0  a    alpha        1  0.0
1   0  a    alpha        2  1.0
2   0  a    alpha        3  2.0
3   1  b     beta        1  0.1
4   1  b     beta        2  1.1
5   1  b     beta        3  2.1
6   2  c    gamma        1  0.2
7   2  c    gamma        2  1.2
8   2  c    gamma        3  2.2
9   3  d    delta        1  0.3
10  3  d    delta        2  1.3
11  3  d    delta        3  2.3
12  4  e  epsilon        1  0.4
13  4  e  epsilon        2  1.4
14  4  e  epsilon        3  2.4
  • Related