Home > Net >  Select row only that contain specific value with correlation to another column and group them
Select row only that contain specific value with correlation to another column and group them

Time:04-12

so i have this kind of dataframe and i want to select only these websites that have TLSVersion1.2 and 1.3 ONLY. There is already a similar thing Pandas select rows that contains both values (inclusive)

      websiteName TLSVersion    strength  kex_info  \
1     website1.lt    TLSv1.2        A      dh 2048   
2     website1.lt    TLSv1.3        A     rsa 2048  
 
3     website2.lt    TLSv1.0        C     rsa 2048   
4     website2.lt    TLSv1.1        A     rsa 2048   
5     website2.lt    TLSv1.2        A     rsa 2048   
6     website2.lt    TLSv1.3        A  ecdh_x25519   

7     website3.lt    TLSv1.0        A      dh 4096   
8     website3.lt    TLSv1.1        A      dh 4096   
9     website3.lt    TLSv1.2        A      dh 4096   
10    website3.lt    TLSv1.3        A  ecdh_x25519   

11    website4.lt    TLSv1.0        A      dh 4096   
12    website4.lt    TLSv1.1        A      dh 4096   
13    website4.lt    TLSv1.2        A      dh 4096   
14    website4.lt    TLSv1.3        A  ecdh_x25519   

Desire Output 1: website1.lt : True

CodePudding user response:

You can do this using isin():

import pandas as pd
df = pd.DataFrame({
    'websiteName' : ['website1.lt']*2 ['website2.lt']*4 ['website3.lt']*4 ['website4.lt']*4,
    'TLSVersion' : ['TLSv1.2', 'TLSv1.3', 'TLSv1.0', 'TLSv1.1', 'TLSv1.2', 'TLSv1.3', 'TLSv1.0', 'TLSv1.1', 'TLSv1.2', 'TLSv1.3', 'TLSv1.0', 'TLSv1.1', 'TLSv1.2', 'TLSv1.3'],
    'strength' : ['A', 'A', 'C', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A'],
    'kex_info' : ['dh 2048', 'rsa 2048', 'rsa 2048', 'rsa 2048', 'rsa 2048', 'ecdh_x25519', 'dh 4096', 'dh 4096', 'dh 4096', 'ecdh_x25519', 'dh 4096', 'dh 4096', 'dh 4096', 'ecdh_x25519']
})
print(f"Sample input dataframe:\n{df}")
df2 = df[df.TLSVersion.isin(['TLSv1.2', 'TLSv1.3'])]
print(f"\nOutput with rows containing 1.2 and 1.3 only:\n{df2}")

Output

Sample input dataframe:
    websiteName TLSVersion strength     kex_info
0   website1.lt    TLSv1.2        A      dh 2048
1   website1.lt    TLSv1.3        A     rsa 2048
2   website2.lt    TLSv1.0        C     rsa 2048
3   website2.lt    TLSv1.1        A     rsa 2048
4   website2.lt    TLSv1.2        A     rsa 2048
5   website2.lt    TLSv1.3        A  ecdh_x25519
6   website3.lt    TLSv1.0        A      dh 4096
7   website3.lt    TLSv1.1        A      dh 4096
8   website3.lt    TLSv1.2        A      dh 4096
9   website3.lt    TLSv1.3        A  ecdh_x25519
10  website4.lt    TLSv1.0        A      dh 4096
11  website4.lt    TLSv1.1        A      dh 4096
12  website4.lt    TLSv1.2        A      dh 4096
13  website4.lt    TLSv1.3        A  ecdh_x25519

Output with rows containing 1.2 and 1.3 only:
    websiteName TLSVersion strength     kex_info
0   website1.lt    TLSv1.2        A      dh 2048
1   website1.lt    TLSv1.3        A     rsa 2048
4   website2.lt    TLSv1.2        A     rsa 2048
5   website2.lt    TLSv1.3        A  ecdh_x25519
8   website3.lt    TLSv1.2        A      dh 4096
9   website3.lt    TLSv1.3        A  ecdh_x25519
12  website4.lt    TLSv1.2        A      dh 4096
13  website4.lt    TLSv1.3        A  ecdh_x25519

CodePudding user response:

I managed to solve this by dividing it into parts: First, I merge all tlscolumns into one row :

allVersionInCollumn = noDuplicates.groupby(['websiteName'])['TLSVersion'].apply(', '.join).reset_index()

allVersionInCollumn result:

    websiteName     TLSVersion
0   website1        TLSv1.0, TLSv1.1, TLSv1.2
1   website2        TLSv1.2
2   website3        TLSv1.2, TLSv1.3
3   website4        TLSv1.0, TLSv1.1, TLSv1.2, TLSv1.3

Then I used these commands to split the columns:

splitedTLS = allVersionInCollumn["TLSVersion"].str.split(",", n=0, expand=True) 
allVersionSplited = allVersionInCollumn.join(splitedTLS)
allVersionSplited = allVersionSplited.drop(columns="TLSVersion")

allVersionSplited.head(10)

    websiteName    0       1       2     3
0   website1    TLSv1.0 TLSv1.1 TLSv1.2 None
1   website2    TLSv1.2 None    None    None
2   website3    TLSv1.2 TLSv1.3 None    None    

Then I used following command to sort Python pandas sort multiple columns based on their values and other rows

allVersionSplited= (allVersionSplited
 .replace({'None': pd.NA})   # assuming 'None' strings
 .melt(id_vars='websiteName')
 .assign(TLS=lambda d: d['value'])
 .pivot_table(index='websiteName', columns='TLS', values='value', aggfunc='first')
 .reset_index()
)  

Droping previous columns

allVersionSplited.drop(allVersionSplited.iloc[:, 1:5],inplace=True,axis=1)

allVersionSplited.head()

TLS websiteName     TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
0   website1        TLSv1.0 NaN     NaN     NaN
1   website2        NaN NaN     TLSv1.2 NaN
2   website3        NaN NaN     TLSv1.2 TLSv1.3

Then removed those rows that didn't have tlv1.0 and tls1.0 support

refined =   allVersionSplited[allVersionSplited["TLSv1.0" ].isnull()] 
refined =   refined[refined["TLSv1.1"].isnull()] 
refined.head()

TLS websiteName  TLSv1.0  TLSv1.1   TLSv1.2 TLSv1.3
1   website2     NaN      NaN       TLSv1.2 NaN
2   website3     NaN      NaN       TLSv1.2 TLSv1.3

Lastly refined.shape[0] gives : 2

  • Related