Import csv/xlsx file into python as a list and find a template match-CodePudding

I have two simplified csv files below (uploaded as images and link in google drive) and I would like to ask how can I import both files in a python list format where I can iterate over the "templates.csv" and find the best match (if no perfect match exist).

https://drive.google.com/drive/folders/1dOv-8fQgeGEsaPMXYU8kbo2mY4B2KkDH?usp=sharing

test.csv = enter image description here

templates.csv = enter image description here

My current issue is how can I import the csv files in such a way that it will be read as a list where: test.csv file becomes:

test = ['AC_CURRENT', 'AC_POWER', 'AC_POWER.MAX', 'DC_CURRENT', 'DC_VOLTAGE', ....]

templates.csv file becomes a list of templates:

template1 = ['AC_CURRENT', 'AC_CURRENT_15MIN', 'AC_CURRENT_A', 'AC_CURRENT_B', 'AC_CURRENT_C',...]

template2 = ['AC_CURRENT', 'AC_POWER', 'AC_POWER.MAX', 'DC_CURRENT', 'DC_VOLTAGE', ....]

template3 = ['AC_CURRENT', 'AC_POWER', 'AC_POWER.MAX', 'AC_VOLTAGE', 'DC_CURRENT', ....]

templateN = ....

The first issue I have is reading both files as a dataframe where it will be converted as a list above then I can perform the iteration, finding a good match, and if no perfect match exists, it will give me a recommendation in percentage which template gives the most attributes match and will also list the missing attributes from my test.csv file that didnt have any match.

Take note that I cannot manipulate the excel file above in a easy format to pull as a list as we natively pull this from a software where these templates sits.

Thanks a lot for the help

I only have this line of codes for now which just reads the files as dataframe using pandas:

import pandas as pd

df1 = pd.read_csv('test.csv', sep=',')
df2 = pd.read_csv('templates.csv', sep=',')
print(df1)
print(df2)

enter image description here

CodePudding user response：

Take this example:

import pandas as pd

df = pd.DataFrame({"Parent": ["x", "x", "y", "y", "z"], "Name": [1, 2, 3, 4, 5]})
>   Parent  Name
  0      x     1
  1      x     2
  2      y     3
  3      y     4
  4      z     5

# group by 'Parent', aggregate group values to lists for each column, then get column 'Name'
grouped_as_series = df.groupby('Parent').agg(list)['Name']
> Parent
  x    [1, 2]
  y    [3, 4]
  z       [5]

# to access single values
print(grouped_as_series["x"])
> [1, 2]

So for your case:

test = df1.groupyby('Parent').agg(list)['Name']['test']
series_of_templates = df2.groupyby('Parent').agg(list)['Name']

Note that I'm not sure if I understand what you plan to do with the data afterwards so a series might not be the best way to store the extracted data. E.g. a dictionary or list might be better.

CodePudding user response：

I have the below code to make the check

import pandas as pd

df1 = pd.read_csv('test.csv', sep=',')
df2 = pd.read_csv('templates.csv', sep=',')

test = df1.groupby('Parent').agg(list)['Name']['test']
series_of_templates = df2.groupby('Parent').agg(list)['Name']

#print(test)
#print(series_of_templates)

same_values = set(test) & set(series_of_templates['template1'])
print (same_values)