Home > Blockchain >  How to insert list of values into a "cell" of a Pandas DataFrame only if condition is met
How to insert list of values into a "cell" of a Pandas DataFrame only if condition is met

Time:10-01

I would like to extract all values that meet a particular condition from a preexisting DataFrame and form them into a list. Is there a way that I can insert all of these values into each cell that meets another condition?

I began with a DataFrame similar to as follows:

Table_1

| ID    | Team Assigned | Date Team Start | Date Team End | Service Date |
__________________________________________________________________________
| 23    |  Red          |   2022-09-01    |  2022-09-29   | 2022-08-15
| 23    |  Blue         |   2022-08-01    |  2022-09-15   | 2022-09-12
| 23    |  Green        |   2022-09-27    |  2022-09-30   | 2022-09-29
| 23    |  Black        |   2022-08-01    |  2022-08-17   | 2022-05-04
| 23    |  Purple       |   2022-08-15    |  2022-09-30   | 2022-08-17
| 07    |  Blue         |   2022-07-03    |  2022-09-30   | 2022-07-05
| 07    |  Red          |   2022-07-03    |  2022-07-05   | 2022-01-01
| 07    |  Purple       |   2022-05-01    |  2022-06-24   | 2022-05-15

I want to pull teams where the ID's service Date is within the team start / end date. From there I want to paste a comma separated list into the column where the ID and service date exist.

I was able to create an object that at least gathers the information using this:

for id in df['ID']:
   a = df[(df['Service Date'] <= df['Date Team End']) &
          (df['Service Date'] >= df['Date Team Start'])
   a = a.groupby(['case no', 'Service Date'])['Team Assigned'].apply(lambda 
   x: list(np.unique(x)))

This gives me an object, but I'm not sure how I can paste those values into each cell with the respective ID and Service Date

I am aiming for something similar to the following:

Table_1

| ID    | Team Assigned | Date Team Start | Date Team End | Service Date | Teams Involved
_________________________________________________________________________|
| 23    |  Red          |   2022-09-01    |  2022-09-29   | 2022-08-15   | Blue, Black, Purple
| 23    |  Blue         |   2022-08-01    |  2022-09-15   | 2022-09-12   | Red, Blue, Purple
| 23    |  Green        |   2022-09-27    |  2022-09-30   | 2022-09-29   | Red, Green, Purple
| 23    |  Black        |   2022-08-01    |  2022-08-17   | 2022-08-01   | Blue, Black 
| 23    |  Purple       |   2022-08-15    |  2022-09-30   | 2022-08-17   | Blue, Black, Purple
| 07    |  Blue         |   2022-07-03    |  2022-09-30   | 2022-07-05   | Blue, Red
| 07    |  Red          |   2022-07-03    |  2022-07-05   | 2022-08-01   | Blue
| 07    |  Purple       |   2022-05-01    |  2022-06-24   | 2022-05-15   | Purple

EDIT:

So I think I found an (albeit unorthodox) solution that appears to work for me thus for. I used the code chunk above with some slight edits:

for id in df['ID']:
   a = df[(df['Service Date'] <= df['Date Team End']) &
          (df['Service Date'] >= df['Date Team Start'])
   a = pd.DataFrame(a.groupby(['case no', 'Service Date'])['Team 
   Assigned'].apply(lambda x: 
   list(np.unique(x)))).reset_index().rename(columns = {'Team 
   Assigned':'List of Teams'}

Then merged this dataframe with my old one:

df2 = pd.merge(df, a, on = ['ID', 'Service Date'], how = 'left)

It sure does feel messy, and I need to verify that everything is working correctly, but this at first glance seems to work.

CodePudding user response:

Considering that OP's dataframe is as follows

df = pd.DataFrame({'ID': [23, 23, 23, 23, 23, 7, 7, 7],
                     'Team Assigned': ['Red', 'Blue', 'Green', 'Black', 'Purple', 'Blue', 'Red', 'Purple'],
                        'Date Team Start': ['2022-09-01', '2022-08-01', '2022-09-27', '2022-08-01', '2022-08-15', '2022-07-03', '2022-07-03', '2022-05-01'],
                        'Date Team End': ['2022-09-29', '2022-09-15', '2022-09-30', '2022-08-17', '2022-09-30', '2022-09-30', '2022-07-05', '2022-06-24'],
                        'Service Date': ['2022-08-15', '2022-09-12', '2022-09-29', '2022-05-04', '2022-08-17', '2022-07-05', '2022-01-01', '2022-05-15']})

# Convert date columns to datetime
df['Date Team Start'] = pd.to_datetime(df['Date Team Start'])
df['Date Team End'] = pd.to_datetime(df['Date Team End'])
df['Service Date'] = pd.to_datetime(df['Service Date'])

[Out]:
   ID Team Assigned Date Team Start Date Team End Service Date
0  23           Red      2022-09-01    2022-09-29   2022-08-15
1  23          Blue      2022-08-01    2022-09-15   2022-09-12
2  23         Green      2022-09-27    2022-09-30   2022-09-29
3  23         Black      2022-08-01    2022-08-17   2022-05-04
4  23        Purple      2022-08-15    2022-09-30   2022-08-17
5   7          Blue      2022-07-03    2022-09-30   2022-07-05
6   7           Red      2022-07-03    2022-07-05   2022-01-01
7   7        Purple      2022-05-01    2022-06-24   2022-05-15

Considering that OP wants to

pull teams where the ID's service Date is within the team start / end date. From there I want to paste a comma separated list into the column where the ID and service date exist.

If I understood correctly, one can write a custom lambda function, such as

df['Teams Involved'] = df.apply(lambda x: ', '.join(df[(df['ID'] == x['ID']) & (df['Service Date'] >= x['Date Team Start']) & (df['Service Date'] <= x['Date Team End'])]['Team Assigned'].unique()), axis=1)

[Out]:
   ID Team Assigned  ... Service Date            Teams Involved
0  23           Red  ...   2022-08-15               Blue, Green
1  23          Blue  ...   2022-09-12         Red, Blue, Purple
2  23         Green  ...   2022-09-29                     Green
3  23         Black  ...   2022-05-04               Red, Purple
4  23        Purple  ...   2022-08-17  Red, Blue, Green, Purple
5   7          Blue  ...   2022-07-05                      Blue
6   7           Red  ...   2022-01-01                      Blue
7   7        Purple  ...   2022-05-15                    Purple

Notes:

  • df['Teams Involved'] is a way to create a new column

  • ', '.join means that one will be joining the list of teams with a comma and space between each team.

  • df[(df['ID'] == x['ID']) means that the ID is the same as the ID in the row.

  • (df['Service Date'] >= x['Date Team Start']) indicates that the Service Date is greater than or equal to the Date Team Start.

  • (df['Service Date'] <= x['Date Team End']) means that the Service Date is less than or equal to the Date Team End.

  • df[(df['ID'] == x['ID']) & (df['Service Date'] >= x['Date Team Start']) & (df['Service Date'] <= x['Date Team End'])] indicates that the dataframe df is filtered to only include the conditions that we just mentioned in the previous points.

  • df[...]['Team Assigned'].unique() is a way to return the unique Team Assigned that satisfy the conditions indicated before.

  • axis=1 is to indicate that the function will be applied to each row.

  • Even though the output in is not equal to the one in the question, in both it is relatively similar to it. And, as OP is "aiming for something similar", I believe that this satisfies OP's requirement.

  • Related