I have data that is in long form. It looks like this
Analysis Mode SID Run ID Accession ID Analyte Concentration
QC - 111522 QC L Be 0.01
QC - 111522 QC M Be 0.04
SAMPLE EM123 111522 E2214172001 Be 1.20
SAMPLE EM124 111522 E2214172002 Be 1.40
............
QC - 111522 QC L Be 0.02
QC - 111522 QC M Be 0.05
............
QC - 111522 QC L V 0.1
QC - 111522 QC M V 0.2
SAMPLE EM123 111522 E2214172001 V 1.0
SAMPLE EM124 111522 E2214172002 V 1.3
............
QC - 111522 QC L V 0.1
QC - 111522 QC M V 0.3
In total the df is size (12879,8)
. There are 23
different analytes. I want to pivot the data so that the analytes are columns
and the values are concentration
columns= [Analysis Mode , SID , Run ID ,Accession ID, Be, V]
This issue is there are duplicate IDs- In the example above QC L was analyzed for Be in run 111522 twice. I need to retain these numbers exactly meaning I do not want to average or do anything to them. It needs to be its own row
The output df I want would look something like this
Analysis Mode SID Run ID Accession ID Be V
QC - 111522 QC L 0.01 0.1
QC - 111522 QC M 0.04 0.2
SAMPLE EM123 111522 E2214172001 1.20 1.0
SAMPLE EM124 111522 E2214172002 1.40 1.3
............
QC - 111522 QC L 0.02 0.1
QC - 111522 QC M 0.05 0.3
I tried using pivot and piviot table but I keep getting duplicate index errors. I also trying using groupby and cumsum but it also gives me duplicate index errors
CodePudding user response:
Input df:
AnalysisMode SID RunID AccessionID Analyte Concentration
0 QC - 111522 QCL Be 0.01
1 QC - 111522 QCM Be 0.04
2 SAMPLE EM123 111522 E2214172001 Be 1.20
3 SAMPLE EM124 111522 E2214172002 Be 1.40
4 QC - 111522 QCL Be 0.02
5 QC - 111522 QCM Be 0.05
6 QC - 111522 QCL V 0.10
7 QC - 111522 QCM V 0.20
8 SAMPLE EM123 111522 E2214172001 V 1.00
9 SAMPLE EM124 111522 E2214172002 V 1.30
10 QC - 111522 QCL V 0.10
11 QC - 111522 QCM V 0.30
The following worked for me, but I feel like there should be a better way (I couldn't figure out how though :) ):
import pandas as pd
df = pd.read_clipboard() # Put your df here
# Note: I edited the columns and values a bit to not have spaces
columns = ["AnalysisMode", "SID", "RunID", "AccessionID"]
dupes = df.groupby(columns ["Analyte"]).cumcount()
out = df.pivot_table(values="Concentration", columns="Analyte", index=columns [dupes]).droplevel(level=4).reset_index()
output df:
Analyte AnalysisMode SID RunID AccessionID Be V
0 QC - 111522 QCL 0.01 0.1
1 QC - 111522 QCL 0.02 0.1
2 QC - 111522 QCM 0.04 0.2
3 QC - 111522 QCM 0.05 0.3
4 SAMPLE EM123 111522 E2214172001 1.20 1.0
5 SAMPLE EM124 111522 E2214172002 1.40 1.3