Home > Back-end >  Data cleanup in Python, removing CSV rows based on a condition
Data cleanup in Python, removing CSV rows based on a condition

Time:10-13

I've came across a bit of a challenge where I need to sanitize data in a CSV file based on the following criteria:

  1. If the data exists with a date, remove the one with an NA value from the file;
  2. If it is a duplicate, remove it; and
  3. If the data exists only own its own, leave it alone.

I am currently able to do both 2 and 3, however I am struggling to make a condition to capture 1 of the criteria.

Sample CSV File

Name,Environment,Available,Date
Server_A,Test,NA,NA
Server_A,Test,Yes,20/08/2022
Server_A,Test,Yes,20/09/2022
Server_A,Test,Yes,20/09/2022
Server_B,Test,NA,NA
Server_B,Test,NA,NA

Current Code So Far

import csv

input_file = 'sample.csv'
output_file = 'completed_output.csv'

with open(input_file, 'r') as inputFile, open(output_file, 'w') as outputFile:
  seen = set()
  for line in inputFile: 
    if line in seen:
      continue
    seen.add(line)
    outputFile.write(line)

Currently, this helps with duplicates and capturing the unique values. However, I cannot work the best way to remove the row that has a repeating server. However, this may not work well because the set type is unordered, so I wasn't sure the best way to compare based on column, then filter down from there.

Any suggestions or solutions that could help me would be greatly appreciated.

Current Output So Far

Name,Environment,Available,Date
Server_A,Test,NA,NA
Server_A,Test,Yes,20/08/2022
Server_A,Test,Yes,20/09/2022
Server_B,Test,NA,NA

Expected Output

Name,Environment,Available,Date
Server_A,Test,Yes,20/08/2022
Server_A,Test,Yes,20/09/2022
Server_B,Test,NA,NA

CodePudding user response:

You can use pandas instead of manually doing all of that. I have written a short function called custom filter which takes into consideration the criteria. One area of potential bugs can be the use of pd.NA, use other np.nan or None if this doesn't work accordingly.

import pandas as pd
df = pd.read_csv('sample.csv')
df = df.drop_duplicates()
data_present = []
def custom_filter(x):
  global data_present
  if x[3] == pd.NA:
    data_present.append(x[0])
    return True
  elif x[3] == pd.NA and x[0] not in data_present:
    return True
  else:
    return False
df = df.sort_values('Date')
df = df[df.apply(custom_filter, axis = 1)]
df.to_csv('completed_output.csv')
  
  • Related