I have a dataframe with three columns. The first column specifies a group into which each row is classified. Each group normally consists of 3 data points (rows), but it is possible for the last group to be "cut off," and contain fewer than three data points. In the real world, this could be due to the experiment or data collection process being cut off prematurely. In the below example, group 3 is cut off and only contains one data point.
import pandas as pd
data = {
"group_id": [0, 0, 0, 1, 1, 1, 2, 2, 2, 3],
"valueA": [420, 380, 390, 500, 270, 220, 150, 400, 330, 170],
"valueB": [50, 40, 45, 22, 20, 50, 10, 60, 90, 10]
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df)
I also have two lists with additional values.
x_list = [1, 3, 5]
y_list = [2, 4, 6]
I want to add these lists to my dataframe as new columns, and have the values repeat for each group. In other words, I want my output to look like this.
group_id valueA valueB x y
0 0 420 50 1 2
1 0 380 40 3 4
2 0 390 45 5 6
3 1 500 22 1 2
4 1 270 20 3 4
5 1 220 50 5 6
6 2 150 10 1 2
7 2 400 60 3 4
8 2 330 90 5 6
9 3 170 10 1 2
Notice that even though the length of a column is not divisible by the length of the shorter lists, the number of rows in the dataframe does not change.
How do I achieve this without losing dataframe rows or adding new rows with NaN values?
CodePudding user response:
You can use GroupBy.cumcount
to generate a indexer, then use this to duplicate the values in order of the groups:
new = pd.DataFrame({'x': x_list, 'y': y_list})
idx = df.groupby('group_id').cumcount()
df[['x', 'y']] = new.reindex(idx).to_numpy()
Output:
group_id valueA valueB x y
0 0 420 50 1 2
1 0 380 40 3 4
2 0 390 45 5 6
3 1 500 22 1 2
4 1 270 20 3 4
5 1 220 50 5 6
6 2 150 10 1 2
7 2 400 60 3 4
8 2 330 90 5 6
9 3 170 10 1 2
CodePudding user response:
As your lists have the same length, you can use:
df[['x', 'y']] = (pd.DataFrame({'x': x_list, 'y': y_list})
.reindex(df.groupby('group_id').cumcount().mod(3)).values)
print(df)
# Output
group_id valueA valueB x y
0 0 420 50 1 2
1 0 380 40 3 4
2 0 390 45 5 6
3 1 500 22 1 2
4 1 270 20 3 4
5 1 220 50 5 6
6 2 150 10 1 2
7 2 400 60 3 4
8 2 330 90 5 6
9 3 170 10 1 2
CodePudding user response:
Let's use np.resize:
import pandas as pd
import numpy as np
data = {
"group_id": [0, 0, 0, 1, 1, 1, 2, 2, 2, 3],
"valueA": [420, 380, 390, 500, 270, 220, 150, 400, 330, 170],
"valueB": [50, 40, 45, 22, 20, 50, 10, 60, 90, 10]
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
df['x'] = np.resize(x_list, len(df))
df['y'] = np.resize(y_list, len(df))
df
Output:
group_id valueA valueB x y
0 0 420 50 1 2
1 0 380 40 3 4
2 0 390 45 5 6
3 1 500 22 1 2
4 1 270 20 3 4
5 1 220 50 5 6
6 2 150 10 1 2
7 2 400 60 3 4
8 2 330 90 5 6
9 3 170 10 1 2
CodePudding user response:
An alternative in case of having lists of different sizes:
lambda_duplicator = lambda lista, lenn, shape : (lista*int(1 shape/lenn))[:shape]
df['x'] = lambda_duplicator(x_list, len(x_list), df.shape[0])
df['y'] = lambda_duplicator(y_list, len(y_list), df.shape[0])