Home > Software engineering >  Adding dataframe columns from shorter lists
Adding dataframe columns from shorter lists

Time:07-01

I have a dataframe with three columns. The first column specifies a group into which each row is classified. Each group normally consists of 3 data points (rows), but it is possible for the last group to be "cut off," and contain fewer than three data points. In the real world, this could be due to the experiment or data collection process being cut off prematurely. In the below example, group 3 is cut off and only contains one data point.

import pandas as pd

data = {
    "group_id": [0, 0, 0, 1, 1, 1, 2, 2, 2, 3], 
    "valueA": [420, 380, 390, 500, 270, 220, 150, 400, 330, 170],
    "valueB": [50, 40, 45, 22, 20, 50, 10, 60, 90, 10]
}

#load data into a DataFrame object:
df = pd.DataFrame(data)

print(df)

I also have two lists with additional values.

x_list = [1, 3, 5]
y_list = [2, 4, 6]

I want to add these lists to my dataframe as new columns, and have the values repeat for each group. In other words, I want my output to look like this.

   group_id  valueA  valueB  x  y
0         0     420      50  1  2
1         0     380      40  3  4
2         0     390      45  5  6
3         1     500      22  1  2
4         1     270      20  3  4
5         1     220      50  5  6
6         2     150      10  1  2
7         2     400      60  3  4
8         2     330      90  5  6
9         3     170      10  1  2

Notice that even though the length of a column is not divisible by the length of the shorter lists, the number of rows in the dataframe does not change.

How do I achieve this without losing dataframe rows or adding new rows with NaN values?

CodePudding user response:

You can use GroupBy.cumcount to generate a indexer, then use this to duplicate the values in order of the groups:

new = pd.DataFrame({'x': x_list, 'y': y_list})
idx = df.groupby('group_id').cumcount()
df[['x', 'y']] = new.reindex(idx).to_numpy()

Output:

   group_id  valueA  valueB  x  y
0         0     420      50  1  2
1         0     380      40  3  4
2         0     390      45  5  6
3         1     500      22  1  2
4         1     270      20  3  4
5         1     220      50  5  6
6         2     150      10  1  2
7         2     400      60  3  4
8         2     330      90  5  6
9         3     170      10  1  2

CodePudding user response:

As your lists have the same length, you can use:

df[['x', 'y']] = (pd.DataFrame({'x': x_list, 'y': y_list})
                    .reindex(df.groupby('group_id').cumcount().mod(3)).values)
print(df)

# Output
   group_id  valueA  valueB  x  y
0         0     420      50  1  2
1         0     380      40  3  4
2         0     390      45  5  6
3         1     500      22  1  2
4         1     270      20  3  4
5         1     220      50  5  6
6         2     150      10  1  2
7         2     400      60  3  4
8         2     330      90  5  6
9         3     170      10  1  2

CodePudding user response:

Let's use np.resize:

import pandas as pd
import numpy as np

data = {
    "group_id": [0, 0, 0, 1, 1, 1, 2, 2, 2, 3], 
    "valueA": [420, 380, 390, 500, 270, 220, 150, 400, 330, 170],
    "valueB": [50, 40, 45, 22, 20, 50, 10, 60, 90, 10]
}

#load data into a DataFrame object:
df = pd.DataFrame(data)
df['x'] = np.resize(x_list, len(df))
df['y'] = np.resize(y_list, len(df))
df

Output:

   group_id  valueA  valueB  x  y
0         0     420      50  1  2
1         0     380      40  3  4
2         0     390      45  5  6
3         1     500      22  1  2
4         1     270      20  3  4
5         1     220      50  5  6
6         2     150      10  1  2
7         2     400      60  3  4
8         2     330      90  5  6
9         3     170      10  1  2

CodePudding user response:

An alternative in case of having lists of different sizes:

lambda_duplicator = lambda lista, lenn, shape : (lista*int(1   shape/lenn))[:shape]

df['x'] = lambda_duplicator(x_list, len(x_list), df.shape[0])
df['y'] = lambda_duplicator(y_list, len(y_list), df.shape[0])
  • Related