I have below dataframe
CVE ID Product Versions
0 CVE-2022-46689 Mac OS 12
1 CVE-2022-42856 Safari
2 CVE-2022-46689 Windows 10 21h1
3 CVE-2022-41121 Windows 10 21h2
4 CVE-2022-42856 Safari
I would like to remove duplicates based on the column CVE ID
but also want to make sure that I store the value present in the 2nd column Product Versions
(but remove the value if already present)
Something like this below:
CVE ID Product Versions
0 CVE-2022-46689 Mac OS 12, Windows 10 21h1
1 CVE-2022-42856 Safari
2 CVE-2022-41121 Windows 10 21h2
How should I do it?
Any help is appreciated
CodePudding user response:
here is one way to do it
# drop duplicates (in memory)
# groupby CVE ID and join the resulting list of product version
out=(df.drop_duplicates(subset=['CVE ID','Product Versions'])
.groupby(['CVE ID'],as_index=False)['Product Versions']
.agg(','.join ))
out
CVE ID Product Versions
0 CVE-2022-41121 Windows 10 21h2
1 CVE-2022-42856 Safari
2 CVE-2022-46689 Mac OS 12, Windows 10 21h1
CodePudding user response:
This could work:
import pandas as pd
# Create the DataFrame
df = pd.DataFrame({'CVE ID': ['CVE-2022-46689', 'CVE-2022-42856', 'CVE-2022-46689', 'CVE-2022-41121', 'CVE-2022-42856'],
'Product Versions': ['Mac OS 12', 'Safari', 'Windows 10 21h1', 'Windows 10 21h2', 'Safari']})
# Group the rows by the 'CVE ID' column
grouped = df.groupby('CVE ID')
# Initialize an empty list to store the results
result = []
# Iterate over the groups
for name, group in grouped:
# Concatenate the values in the 'Product Versions' column
product_versions = ', '.join(list(set(group['Product Versions'])))
# Append the name and product_versions to the result list
result.append({'CVE ID': name, 'Product Versions': product_versions})
# Convert the result list to a DataFrame
result = pd.DataFrame(result)
print(result)