Home > database >  How to accumulate single cells from repeated rows in pandas
How to accumulate single cells from repeated rows in pandas

Time:04-13

This is my first post, and after some problems I fixed them. Point was, I had a CSV which I wanted to modify, relocating a cell only when the next row is repeated, in order to accumulate the values of the repeated values in only one row. If you want to work it several times repeated, you'll have to execute it that same amount of times. If someone could improve it, it would be amazing as well.

import pandas as pd
#df = pd.read_csv('MyCSV.csv', sep='\t', index_col=False, header=0)
#In my case, csv didn't work because of settings from read_csv
#Which looks like next serie:

data = [['LINF_01000000', 'LINJ.1:11111:222222:-', 'N/A', 'N/A', 'N/A'], ['LINF_01000000', 'LINJ.1:122222:333333:-', 'N/A', 'N/A', 'N/A'], ['LINF_01000000', 'N/A', 'N/A', 'N/A', 'N/A'], ['LINF_02000000', 'LINJ.1:10011:2211122:-', 'N/A', 'N/A', 'N/A'], ['LINF_03000000', 'LINJ.1:14441:666222:-', 'N/A', 'N/A', 'N/A'], ['LINF_03000000', 'LINJ.1:77435:2543522:-', 'N/A', 'N/A', 'N/A'], ['LINF_03000000', 'LINJ.1:544351:22543542:-', 'N/A', 'N/A', 'N/A']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['LINF_MIX', 'Accession', 'Accession2', 'Accession3', 'Accession4'])
 #and then start my script here to execute:
column0 = df.iloc[:, 0]
index_Linf_Rep = 2
for LINF in column0: 
    index_Linf = index_Linf_Rep - 1
    LINF = df.iloc[index_Linf, 0]
    LINF_Rep = df.iloc[index_Linf_Rep, 0]
    if LINF == LINF_Rep:
        coordenada_repet = df.iloc[index_Linf_Rep, 1]
        coordenada_repet2 = df.iloc[index_Linf_Rep, 2]
        coordenada_repet3 = df.iloc[index_Linf_Rep, 3]
        coordenada_repet4 = df.iloc[index_Linf_Rep, 4]
        coordenada_repet5 = df.iloc[index_Linf_Rep, 5]
        coordenada_intr = df.iloc[index_Linf, 2]
        coordenada_intr2 = df.iloc[index_Linf, 3]
        coordenada_intr3 = df.iloc[index_Linf, 4]
        coordenada_intr4 = df.iloc[index_Linf, 5]
        coordenada_intr5 = df.iloc[index_Linf, 6]
        df.iat[index_Linf, 2] = coordenada_repet
        print("Hay matches")
        if coordenada_intr == coordenada_repet:
            df.iat[index_Linf, 3] = coordenada_repet2
        elif coordenada_intr2 == coordenada_repet2: 
            df.iat[index_Linf, 4] = coordenada_repet3
        elif coordenada_intr3 == coordenada_repet3: 
            df.iat[index_Linf, 5] = coordenada_repet4
            print(LINF   ' se ha repetido 4 veces')
        elif coordenada_intr4 == coordenada_repet4: 
            df.iat[index_Linf, 6] = coordenada_repet5
            print(LINF   ' se ha repetido 5 veces')
    else: 
        print('No hay match en estos dos LINF: ')
        print(LINF   ' '   LINF_Rep)
    index_Linf_Rep  = 1
        if index_Linf_Rep == 9427:
            break
#
df.to_csv('MyCSV.csv', index=False)

Thank you.

CodePudding user response:

I don't understand what you are trying to accomplish. Do you want to drop repeated rows or do you want to add them together/ concatenate the strings.

It's also a bit confusing because here:

coordenada_repet = df.iloc[index_ni, 1]
coordenada_intr = df.iloc[index_i, 2]

You are getting the value of the second row and second column and then the value of the first row and third column, I don't know if this is right for you. I'd encourage you to look into pandas.Series.fill (method="ffill"), pandas.DataFrame.where (or np.where) and pandas.Series.shift

You can use a combination of creating a column that is shifted one row and loop over original and shifted to get the repeated values.

In whatever case, I believe the df.at function takes square brackets like this:

df.at[index_name, column_name]

And the df.iat function looks like this:

df.iat[index_index,column_index]

At reference here and iat reference here

In your code you are using at() so maybe try adjusting that. Do let me know about my question above, as there is likely a simpler solution

EDIT: Vectorized code

df_shifted = df.shift(-1)
new_df = df.copy()
new_df["Accession2"] = np.where(df["LINF_MIX"] == 
df_shifted["LINF_MIX"], df_shifted["Accession"], df["Accession2"])
new_df["Accession4"] = np.where((df["Accession2"] == "N/A") & 
(["Accession3"] != "N/A"), df_shifted["Accession3"], df["Accession4"])
new_df["Accession4"] = new_df["Accession4"].fillna("N/A")
new_df["Accession3"] = np.where(df["Accession2"] != "N/A", df_shifted["Accession2"], df["Accession2"])
df.equals(new_df)

You can try the vectorized code. From the error message you posted I believe the error is coming from one of the columns having the wrong data type. That or you're trying to access an iloc using a string somewhere.

  • Related