Dividing one dataframe by another in python using pandas with float values-CodePudding

I have two separate data frames named df1 and df2 as shown below:

    Scaffold  Position  Ref_Allele_Count  Alt_Allele_Count  Coverage_Depth  Alt_Allele_Frequency
0          1        11                 7                51              58              0.879310
1          1        16                20                95             115              0.826087
2          2         9                 9                33              42              0.785714
3          2        12                86                51             137              0.372263
4          2        67                41                98             139              0.705036
5          3         8                 0                 0               0              0.000000
6          4        99                32                26              58              0.448276
7          4       101               100                24             124              0.193548
8          4       115                69                26              95              0.273684
9          5         6                40                57              97              0.587629
10         5        19                53                87             140              0.621429
    Scaffold  Position  Ref_Allele_Count  Alt_Allele_Count  Coverage_Depth  Alt_Allele_Frequency
0          1        11                 7                64              71              0.901408
1          1        16                10                90             100              0.900000
2          2         9                79                86             165              0.521212
3          2        12                12                73              85              0.858824
4          2        67                54                96             150              0.640000
5          3         8                 0                 0               0              0.000000
6          4        99                86                28             114              0.245614
7          4       101                32                25              57              0.438596
8          4       115                97                16             113              0.141593
9          5         6                86                43             129              0.333333
10         5        19                59                27              86              0.313953

I have already found the sum values for df1 and df2 in Allele_Count and Coverage Depth but I need to divide the resulting Alt_Allele_Count and Coverage_Depth of both df's with one another to fine the total allele frequency(AF). I have tried dividing the two variable and got the error message : TypeError: float() argument must be a string or a number, not 'DataFrame' when I tried to convert them to floats and this table when I laft it as a df:

    Alt_Allele_Count  Coverage_Depth
0                NaN             NaN
1                NaN             NaN
2                NaN             NaN
3                NaN             NaN
4                NaN             NaN
5                NaN             NaN
6                NaN             NaN
7                NaN             NaN
8                NaN             NaN
9                NaN             NaN
10               NaN             NaN

My code so far:

import csv
import pandas as pd
import numpy as np

df1 = pd.read_csv('C:/Users/Tom/Python_CW/file_pairA_1.csv')
df2 = pd.read_csv('C:/Users/Tom/Python_CW/file_pairA_2.csv')
print(df1)
print(df2)


Ref_Allele_Count = (df1[['Ref_Allele_Count']]   df2[['Ref_Allele_Count']])
print(Ref_Allele_Count)

Alt_Allele_Count = (df1[['Alt_Allele_Count']]   df2[['Alt_Allele_Count']])
print(Alt_Allele_Count)

Coverage_Depth = (df1[['Coverage_Depth']]   df2[['Coverage_Depth']]).astype(float)
print(Coverage_Depth)

AF = Alt_Allele_Count / Coverage_Depth

print(AF)

CodePudding user response：

The error stems from the difference between a pandas series and a dataframe. Series are 1 dimensional structures like a singular column, while dataframes are 2d objects like tables. Series added together make a new series of values while dataframes added together make something a lot less usable.

Taking slices of a dataframe can either result in a series or dataframe object depending on how you do it:

df['column_name'] -> Series
df[['column_name', 'column_2']] -> Dataframe

So in the line:

Ref_Allele_Count = (df1[['Ref_Allele_Count']]   df2[['Ref_Allele_Count']])

df1[['Ref_Allele_Count']] becomes a singular column dataframe rather than a series.

Ref_Allele_Count = (df1['Ref_Allele_Count']   df2['Ref_Allele_Count'])

Should return the correct result here. Same goes for the rest of the columns you're adding together.

CodePudding user response：

This can be fixed by only using once set of brackets '[]' while referring to a column in a pandas df, rather than 2.