I am trying to write a function that takes in a dataframe, that has some columns that are within the same genre, and some columns are not. An example of the columns would be:
['id', 't_dur0', 't_dur1', 't_dur2', 't_dance0', 't_dance1', 't_dance2', 't_energy0',
't_energy1', 't_energy2']
I am tryiong to generate two new dataframes, one with the columns without duplicates, and one with only the duplicate columns with the code below:
# Function that takes in a dataframe and returns new dataframes with all the sub-dataframes
def sub_dataframes(dataframe):
copy = dataframe.copy() # To avoid SettingWithCopyWarning
# Iterate through all the columns of the df
for (col_name, col_data) in copy.iteritems():
temp = str(col_name)
rest = copy.iloc[:, 1:]
new_df = [[]]
# If it's not a duplicate, we just add it to the new df
if len(temp) < 6:
new_df[temp] = copy[col_data]
# If the length of the column name is greater than or equal to 6, we know it's a duplicate
if len(temp) >= 6:
stripped = temp.rstrip(temp[2:])
# Second for-loop to check the next column
for (col_name2, col_data2) in rest.iteritems():
temp2 = str(col_name2)
rest2 = rest.iloc[:, 1:]
only_dups = [[]]
if len(temp2) >= 6:
stripped2 = temp2.rstrip(temp2[2:])
# Compare the two column names (without the integer 0,1, or 2)
if stripped[:-1] == stripped2[:-1]:
# Create new df of the two columns
only_dups[stripped] = col_data
only_dups[stripped2] = col_data2
# Third for-loop to check the remaining columns
for (col_name3, col_data3) in rest2.iteritems():
temp3 = str(col_name3)
if len(temp3) >= 6:
stripped3 = temp3.rstrip(temp3[2:])
# Compare the two column names (without the integer 0,1, or 2)
if stripped2[:-1] == stripped3[:-1]:
only_dups[stripped3] = col_data3
print("Original:\n{}\nWithout duplicates:\n{}\nDuplicates:\n{}".format(copy, new_df, only_dups))
sub_dataframes(df)
When I run this code, I get this error:
KeyError: "None of [Int64Index([ 22352, 106534, 23608, 8655, 49670, 101988, 9136,
141284,\n 28564, 14262,\n ...\n 76690, 150965,
143106, 142370, 68004, 33980, 110832, 14491,\n 123511, 6207],\n
dtype='int64', length=2833)] are in the [columns]"
I tried looking at other questions here on StackOverflow, to see if I can fix the problem, but all I've understood so far, is that I cannot add columns the way I have now, with new_df[temp] = copy[col_data]
or only_dups[stripped] = col_data
, but I can't seem to figure out how to properly create new columns. How do I add a new column based on the variables I have now? Is it possible, or do I have to re-write the code so that it doesn't have so many for-loops?
EDIT
An example of the output I want would be:
Original:
id t_dur0 t_dur1 t_dur2 ...
0 22352 292720 293760.0 292733.0
1 106534 213760 181000.0 245973.0
2 23608 157124 130446.0 152450.0
3 8655 127896 176351.0 166968.0
4 49670 210320 226253.0 211880.0
... ... ... ... ...
Without duplicates:
id
0 22352
1 106534
2 23608
3 8655
4 49670
... ..
Duplicates:
t_dur0 t_dur1 t_dur2
0 292720 293760.0 292733.0
1 213760 181000.0 245973.0
2 157124 130446.0 152450.0
3 127896 176351.0 166968.0
4 210320 226253.0 211880.0
... ... ... ...
CodePudding user response:
IIUC:
def sub_dataframes(dataframe):
# extract common prefix -> remove trailing digits
cols = dataframe.columns.str.replace(r'\d*$', '', regex=True) \
.to_series().value_counts()
# split columns
unq_cols = cols[cols == 1].index
dup_cols = dataframe.columns[~dataframe.columns.isin(unq_cols)]
return (dataframe[unq_cols], dataframe[dup_cols])
df1, df2 = sub_dataframes(df)
Output:
>>> df1
id
0 22352
1 106534
2 23608
3 8655
4 49670
>>> df2
t_dur0 t_dur1 t_dur2
0 292720 293760.0 292733.0
1 213760 181000.0 245973.0
2 157124 130446.0 152450.0
3 127896 176351.0 166968.0
4 210320 226253.0 211880.0
CodePudding user response:
You could remove the digits and determine whether the columns become duplicated
:
mask = df.columns.str.replace(r'\d ', '', regex=True).duplicated(keep=False)
# duplicated columns
df1 = df.loc[:, mask]
# unique columns
df2 = df.loc[:, ~mask]