KeyError: "None of [['', '']] are in the [columns]" (Pandas Dataframe)-CodePudding

I am trying to write a function that takes in a dataframe, that has some columns that are within the same genre, and some columns are not. An example of the columns would be:

['id', 't_dur0', 't_dur1', 't_dur2', 't_dance0', 't_dance1', 't_dance2', 't_energy0', 
 't_energy1', 't_energy2']

I am tryiong to generate two new dataframes, one with the columns without duplicates, and one with only the duplicate columns with the code below:

# Function that takes in a dataframe and returns new dataframes with all the sub-dataframes

def sub_dataframes(dataframe):

    copy = dataframe.copy()                  # To avoid SettingWithCopyWarning

    # Iterate through all the columns of the df
    for (col_name, col_data) in copy.iteritems():

        temp = str(col_name)
        rest = copy.iloc[:, 1:]
        new_df = [[]]

        # If it's not a duplicate, we just add it to the new df
        if len(temp) < 6:
            new_df[temp] = copy[col_data]

        # If the length of the column name is greater than or equal to 6, we know it's a duplicate
        if len(temp) >= 6:
            stripped = temp.rstrip(temp[2:])

            # Second for-loop to check the next column
            for (col_name2, col_data2) in rest.iteritems():
                temp2 = str(col_name2)
                rest2 = rest.iloc[:, 1:]
                only_dups = [[]]

                if len(temp2) >= 6:
                    stripped2 = temp2.rstrip(temp2[2:])

                    # Compare the two column names (without the integer 0,1, or 2)
                    if stripped[:-1] == stripped2[:-1]:

                        # Create new df of the two columns
                        only_dups[stripped] = col_data
                        only_dups[stripped2] = col_data2

                        # Third for-loop to check the remaining columns
                        for (col_name3, col_data3) in rest2.iteritems():
                            temp3 = str(col_name3)

                            if len(temp3) >= 6:
                                stripped3 = temp3.rstrip(temp3[2:])

                                # Compare the two column names (without the integer 0,1, or 2)
                                if stripped2[:-1] == stripped3[:-1]:
                                    only_dups[stripped3] = col_data3

    print("Original:\n{}\nWithout duplicates:\n{}\nDuplicates:\n{}".format(copy, new_df, only_dups))


sub_dataframes(df)

When I run this code, I get this error:

KeyError: "None of [Int64Index([ 22352, 106534,  23608,   8655,  49670, 101988,   9136, 
141284,\n             28564,  14262,\n            ...\n             76690, 150965, 
143106, 142370,  68004,  33980, 110832,  14491,\n            123511,   6207],\n           
dtype='int64', length=2833)] are in the [columns]"

I tried looking at other questions here on StackOverflow, to see if I can fix the problem, but all I've understood so far, is that I cannot add columns the way I have now, with new_df[temp] = copy[col_data] or only_dups[stripped] = col_data, but I can't seem to figure out how to properly create new columns. How do I add a new column based on the variables I have now? Is it possible, or do I have to re-write the code so that it doesn't have so many for-loops?

EDIT

An example of the output I want would be:

Original:
        id    t_dur0    t_dur1    t_dur2    ... 
0      22352  292720  293760.0  292733.0  
1     106534  213760  181000.0  245973.0 
2      23608  157124  130446.0  152450.0  
3       8655  127896  176351.0  166968.0  
4      49670  210320  226253.0  211880.0  
...      ...     ...       ...       ...

Without duplicates:
        id  
0      22352
1     106534
2      23608
3       8655
4      49670
...      ..

Duplicates: 
      t_dur0  t_dur1    t_dur2   
0     292720  293760.0  292733.0  
1     213760  181000.0  245973.0 
2     157124  130446.0  152450.0  
3     127896  176351.0  166968.0  
4     210320  226253.0  211880.0  
...      ...     ...       ...

CodePudding user response：

IIUC:

def sub_dataframes(dataframe):
  # extract common prefix -> remove trailing digits
  cols = dataframe.columns.str.replace(r'\d*$', '', regex=True) \
                  .to_series().value_counts()

  # split columns
  unq_cols = cols[cols == 1].index
  dup_cols = dataframe.columns[~dataframe.columns.isin(unq_cols)]

  return (dataframe[unq_cols], dataframe[dup_cols])

df1, df2 = sub_dataframes(df)

Output:

>>> df1
       id
0   22352
1  106534
2   23608
3    8655
4   49670

>>> df2
   t_dur0    t_dur1    t_dur2
0  292720  293760.0  292733.0
1  213760  181000.0  245973.0
2  157124  130446.0  152450.0
3  127896  176351.0  166968.0
4  210320  226253.0  211880.0

CodePudding user response：

You could remove the digits and determine whether the columns become duplicated:

mask = df.columns.str.replace(r'\d ', '', regex=True).duplicated(keep=False)

# duplicated columns
df1 = df.loc[:, mask]

# unique columns
df2 = df.loc[:, ~mask]