I have two simplified csv files below (uploaded as images and link in google drive) and I would like to ask how can I import both files in a python list format where I can iterate over the "templates.csv" and find the best match (if no perfect match exist).
https://drive.google.com/drive/folders/1dOv-8fQgeGEsaPMXYU8kbo2mY4B2KkDH?usp=sharing
test.csv = enter image description here
templates.csv = enter image description here
My current issue is how can I import the csv files in such a way that it will be read as a list where: test.csv file becomes:
test = ['AC_CURRENT', 'AC_POWER', 'AC_POWER.MAX', 'DC_CURRENT', 'DC_VOLTAGE', ....]
templates.csv file becomes a list of templates:
template1 = ['AC_CURRENT', 'AC_CURRENT_15MIN', 'AC_CURRENT_A', 'AC_CURRENT_B', 'AC_CURRENT_C',...]
template2 = ['AC_CURRENT', 'AC_POWER', 'AC_POWER.MAX', 'DC_CURRENT', 'DC_VOLTAGE', ....]
template3 = ['AC_CURRENT', 'AC_POWER', 'AC_POWER.MAX', 'AC_VOLTAGE', 'DC_CURRENT', ....]
templateN = ....
The first issue I have is reading both files as a dataframe where it will be converted as a list above then I can perform the iteration, finding a good match, and if no perfect match exists, it will give me a recommendation in percentage which template gives the most attributes match and will also list the missing attributes from my test.csv file that didnt have any match.
Take note that I cannot manipulate the excel file above in a easy format to pull as a list as we natively pull this from a software where these templates sits.
Thanks a lot for the help
I only have this line of codes for now which just reads the files as dataframe using pandas:
import pandas as pd
df1 = pd.read_csv('test.csv', sep=',')
df2 = pd.read_csv('templates.csv', sep=',')
print(df1)
print(df2)
CodePudding user response:
Take this example:
import pandas as pd
df = pd.DataFrame({"Parent": ["x", "x", "y", "y", "z"], "Name": [1, 2, 3, 4, 5]})
> Parent Name
0 x 1
1 x 2
2 y 3
3 y 4
4 z 5
# group by 'Parent', aggregate group values to lists for each column, then get column 'Name'
grouped_as_series = df.groupby('Parent').agg(list)['Name']
> Parent
x [1, 2]
y [3, 4]
z [5]
# to access single values
print(grouped_as_series["x"])
> [1, 2]
So for your case:
test = df1.groupyby('Parent').agg(list)['Name']['test']
series_of_templates = df2.groupyby('Parent').agg(list)['Name']
Note that I'm not sure if I understand what you plan to do with the data afterwards so a series might not be the best way to store the extracted data. E.g. a dictionary or list might be better.
CodePudding user response:
I have the below code to make the check
import pandas as pd
df1 = pd.read_csv('test.csv', sep=',')
df2 = pd.read_csv('templates.csv', sep=',')
test = df1.groupby('Parent').agg(list)['Name']['test']
series_of_templates = df2.groupby('Parent').agg(list)['Name']
#print(test)
#print(series_of_templates)
same_values = set(test) & set(series_of_templates['template1'])
print (same_values)