Home > Enterprise >  pandas: import csv & pivot on duplicate column names
pandas: import csv & pivot on duplicate column names

Time:10-16

I have a csv file representing measurements done to the matrix:

 A B C
1
2
3 

in the csv:

  • The first 2 columns ("Col" & "row") represent the coordinates of the matrix
  • The rest of the columns represent measurement setup (0, 1 or 2)

Since measurements are repeated the setup is repeated: (0,1,2,1,0) columns repeat.

Here is the datasets


Col row 0   1     2    1    0 #these represent measurement setup
1   A   1   0.7   0.5  0.4  0.5
2   B   1   0.4   0.2  0.1  0.1
3   C   1   0.3   0.2  0.1  0.1

i want to:

  • import file
  • pivot to create columns "col - row - setup - iteration"

where setup represents the measurement setup (0,1,2,1,0)

and iteration counts the measurements done for a single pixel (col,row pair) (1,2,3,4,5)

The final dateset should look like this:

       
Col row setup iteration  value   
1   A   0     1          1
1   A   1     2          0.7
1   A   2     3          0.5
1   A   1     4          0.4
1   A   0     5          0.5
2   B   0     1          1
2   B   1     2          0.4
2   B   2     3          0.2
2   B   1     4          0.1
2   B   0     5          0.1
.....etc

CodePudding user response:

You want to unpivot or melt your dataframe:

df_orig = pd.read_csv('file.csv')
df = df_orig.melt(['Col', 'row'], var_name='setup')

It gives:

    Col row setup  value
0     1   A     0    1.0
1     2   B     0    1.0
2     3   C     0    1.0
3     1   A     1    0.7
4     2   B     1    0.4
5     3   C     1    0.3
6     1   A     2    0.5
7     2   B     2    0.2
8     3   C     2    0.2
9     1   A   1.1    0.4
10    2   B   1.1    0.1
11    3   C   1.1    0.1
12    1   A   0.1    0.5
13    2   B   0.1    0.1
14    3   C   0.1    0.1

Because by default (or for older version) read_csv has mangled duplicate column name, we have to demangle the setup column:

df['setup'] = df['setup'].replace('\\..*', '', regex=True)

Just add an iteration column:

df['iteration'] = 1   df.index // len(df_orig)

and sort the dataframe:

df.sort_values(['Col', 'row', 'iteration']).reset_index(drop=True)

to get:

    Col row setup  value  iteration
0     1   A     0    1.0          0
1     1   A     1    0.7          1
2     1   A     2    0.5          2
3     1   A     1    0.4          3
4     1   A     0    0.5          4
5     2   B     0    1.0          0
6     2   B     1    0.4          1
7     2   B     2    0.2          2
8     2   B     1    0.1          3
9     2   B     0    0.1          4
10    3   C     0    1.0          0
11    3   C     1    0.3          1
12    3   C     2    0.2          2
13    3   C     1    0.1          3
14    3   C     0    0.1          4
  • Related