I am a little new to this. I am currently experimenting with data frames in python and am a little stuck with something. I need to get the columns in the data frame which have the same difference between their unique sorted elements. I am able to do it in a stand-alone code, but I want to do it dynamically by getting it from the data frame in a file.
import numpy as np
import pandas as pd
first = [20, 10, 40, 30, 10]
sec = [94, 74, 34, 80]
df = pd.DataFrame([(first,sec) for first,sec in zip(first,sec)])
print(df)
cols = list(df.columns)
sorted_df = df.sort_values(by = cols, ascending = True)
print("sorted - \n", sorted_df)
all_unique = [sorted_df[col].unique() for col in cols]
print("UNIQUE:\n", all_unique)
diff = [np.diff(lst) for last in all_unique]
print("DIFF - \n", diff)
I am able to get the list of lists of the difference. Now I need to check if all the elements in the diff are the same, if yes then have to return the name of the column, be it first or sec. The output I got is:
0 1
0 20 94
1 10 74
2 20 34
3 30 80
sorted -
0 1
0 20 94
1 10 74
2 20 30
3 30 80
UNIQUE:
[array([10, 20, 30]), array([74, 34, 94, 80])]
DIFF -
[array([10, 10]), array([-40, 60, -14])]
After this, I should return the column name or the list name which has the same elements. The desired output should be a list of column names of the columns which have the same difference of the sorted unique elements. So here it should be:
output - ['first']
CodePudding user response:
Use list comprehension with test if sorted values differencies are unique:
#without unique values
output = [col for col in cols if df[col].sort_values().diff().nunique() == 1]
print("OUT - \n", output)
[0]
#with unique values
output = [col for col in cols
if df[col].drop_duplicates().sort_values().diff().nunique() == 1]
Or:
output = [col for col in cols if np.unique(np.diff(np.unique(df[col]))).shape[0] == 1]
print("OUT - \n", output)
[0]