Home > Software design >  using python I would like to remove duplicate rows based on first column but would like to keep all
using python I would like to remove duplicate rows based on first column but would like to keep all

Time:01-04

I have below dataframe

              CVE ID             Product Versions
0      CVE-2022-46689                   Mac OS 12
1      CVE-2022-42856                      Safari
2      CVE-2022-46689             Windows 10 21h1
3      CVE-2022-41121             Windows 10 21h2
4      CVE-2022-42856                      Safari

I would like to remove duplicates based on the column CVE ID but also want to make sure that I store the value present in the 2nd column Product Versions (but remove the value if already present)

Something like this below:

              CVE ID             Product Versions
0      CVE-2022-46689            Mac OS 12, Windows 10 21h1
1      CVE-2022-42856            Safari
2      CVE-2022-41121            Windows 10 21h2

How should I do it?

Any help is appreciated

CodePudding user response:

here is one way to do it

# drop duplicates (in memory)
# groupby CVE ID and join the resulting list of product version

out=(df.drop_duplicates(subset=['CVE ID','Product Versions'])
 .groupby(['CVE ID'],as_index=False)['Product Versions']
 .agg(','.join ))

out
            CVE ID  Product Versions
0   CVE-2022-41121  Windows 10 21h2
1   CVE-2022-42856  Safari
2   CVE-2022-46689  Mac OS 12, Windows 10 21h1

CodePudding user response:

This could work:

import pandas as pd

# Create the DataFrame
df = pd.DataFrame({'CVE ID': ['CVE-2022-46689', 'CVE-2022-42856', 'CVE-2022-46689', 'CVE-2022-41121', 'CVE-2022-42856'],
                'Product Versions': ['Mac OS 12', 'Safari', 'Windows 10 21h1', 'Windows 10 21h2', 'Safari']})

# Group the rows by the 'CVE ID' column
grouped = df.groupby('CVE ID')

# Initialize an empty list to store the results
result = []

# Iterate over the groups
for name, group in grouped:
    # Concatenate the values in the 'Product Versions' column
    product_versions = ', '.join(list(set(group['Product Versions'])))
    
    # Append the name and product_versions to the result list
    result.append({'CVE ID': name, 'Product Versions': product_versions})

# Convert the result list to a DataFrame
result = pd.DataFrame(result)

print(result)
  • Related