I've came across a bit of a challenge where I need to sanitize data in a CSV file based on the following criteria:
- If the data exists with a date, remove the one with an NA value from the file;
- If it is a duplicate, remove it; and
- If the data exists only own its own, leave it alone.
I am currently able to do both 2 and 3, however I am struggling to make a condition to capture 1 of the criteria.
Sample CSV File
Name,Environment,Available,Date
Server_A,Test,NA,NA
Server_A,Test,Yes,20/08/2022
Server_A,Test,Yes,20/09/2022
Server_A,Test,Yes,20/09/2022
Server_B,Test,NA,NA
Server_B,Test,NA,NA
Current Code So Far
import csv
input_file = 'sample.csv'
output_file = 'completed_output.csv'
with open(input_file, 'r') as inputFile, open(output_file, 'w') as outputFile:
seen = set()
for line in inputFile:
if line in seen:
continue
seen.add(line)
outputFile.write(line)
Currently, this helps with duplicates and capturing the unique values. However, I cannot work the best way to remove the row that has a repeating server. However, this may not work well because the set type is unordered, so I wasn't sure the best way to compare based on column, then filter down from there.
Any suggestions or solutions that could help me would be greatly appreciated.
Current Output So Far
Name,Environment,Available,Date
Server_A,Test,NA,NA
Server_A,Test,Yes,20/08/2022
Server_A,Test,Yes,20/09/2022
Server_B,Test,NA,NA
Expected Output
Name,Environment,Available,Date
Server_A,Test,Yes,20/08/2022
Server_A,Test,Yes,20/09/2022
Server_B,Test,NA,NA
CodePudding user response:
You can use pandas instead of manually doing all of that. I have written a short function called custom filter which takes into consideration the criteria. One area of potential bugs can be the use of pd.NA, use other np.nan or None if this doesn't work accordingly.
import pandas as pd
df = pd.read_csv('sample.csv')
df = df.drop_duplicates()
data_present = []
def custom_filter(x):
global data_present
if x[3] == pd.NA:
data_present.append(x[0])
return True
elif x[3] == pd.NA and x[0] not in data_present:
return True
else:
return False
df = df.sort_values('Date')
df = df[df.apply(custom_filter, axis = 1)]
df.to_csv('completed_output.csv')