I have this code that goes through a csv, finds meaningful columns for me, and then drops columns that are not in the list. It works perfectly, but I want it to drop all columns not in found, except one called "MATNR." What can I add to the drop statement that will allow me to drop all of the undesired columns still, except "MATNR"?
# Import Data Quality Rules (useful attributes)
rexp = re.compile('\.([A-Z] )')
found = []
with open('DataRules.csv') as f:
for line in f:
found.extend(rexp.findall(line))
# Get rid of columns that are not mentioned in rules (except MATNR)
df.drop(columns=([col for col in df if col not in found]), inplace=True)
# Get rid of duplicated rows
df = df.drop_duplicates()
CodePudding user response:
# Import Data Quality Rules (useful attributes)
rexp = re.compile('\.([A-Z] )')
found = []
with open('DataRules.csv') as f:
for line in f:
found.extend(rexp.findall(line))
# Get rid of columns that are not mentioned in rules (except MATNR)
df.drop(columns=([col for col in df if col not in found and col != 'MATNR']), inplace=True)
# Get rid of duplicated rows
df = df.drop_duplicates()