I have a csv that looks something like this:
file.csv
name,apptype
AppABC,python
appabc,python
AppABB,python
AppABA,python
Appaba,python
I need to figure out a way to determine if any "name" exists as a case insensitive duplicate and report back the results.
In this case I should know that the following are duplicates:
AppABC,python
appabc,python
AppABA,python
Appaba,python
This is what I was trying, but it's not working.
with open(appcsv_path) as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
for name in csv_reader:
re.findall(name, csv_reader, flags=re.IGNORECASE)
This results in an error:
TypeError: unhashable type: 'list'
Using the Pandas method below but editing it for "Name" not "name":
df = pd.read_csv(appcsv_path)
out = df[df.Name.str.strip().str.lower().duplicated(keep=False)].loc[0:0]
print(out.to_string(index=False))
Results in:
Empty DataFrame
Columns: [Name, Type]
Index: []
CodePudding user response:
Here is a pandas solution using duplicated
import pandas as pd
df = pd.read_csv(appcsv_path)
out = df[df.name.str.strip().str.lower().duplicated(keep=False)].loc[:,'name']
Output :
which will give you the expected output
print(out.to_string(index=False))
AppABC
appabc
AppABA
Appaba
or to keep both the columns you can do
out = df[df.name.str.strip().str.lower().duplicated(keep=False)]
print(out.to_string(index=False))
which gives you
name apptype
AppABC python
appabc python
AppABA python
Appaba python