How do I create a column with a list of only elements from other groups-CodePudding

I have a data frame that has a list of names by company and subgroups of studios.

year	company_id	studio_id	employees
2000	1111	1	James wan,Tessa Belle
2000	1111	2	Hanako Nana,Julian Elijah
2000	1111	3	Jerome Venmo,Mandarin Chicken
2000	6711	1	Hainz Chin,Jimmy Kim
2000	6711	2	Gana Cho,Jesus Christ
2001	1111	2	Hanako Nana,Julian Elijah,James Wan
2001	1111	3	Jerome Venmo,Mandarin Chicken

I want to add another column that has other studio's employees for each year.

year	company_id	studio_id	employees	other_studio_emp
2000	1111	1	James wan,Tessa Belle	Hanako Nana,Julian Elijah,Jerome Venmo,Mandarin Chicken
2000	1111	2	Hanako Nana,Julian Elijah	James wan,Tessa Belle,Jerome Venmo,Mandarin Chicken
2000	1111	3	Jerome Venmo,Mandarin Chicken	James wan,Tessa Belle,Hanako Nana,Julian Elijah
2000	6711	1	Hainz Chin,Jimmy Kim	Gana Cho,Jesus Christ
2000	6711	2	Gana Cho,Jesus Christ	Hainz Chin,Jimmy Kim
2001	1111	2	Hanako Nana,Julian Elijah,James Wan	Jerome Venmo,Mandarin Chicken
2001	1111	3	Jerome Venmo,Mandarin Chicken	Hanako Nana,Julian Elijah,James Wan

How should I apporach this?

CodePudding user response：

So, given the following dataframe:

from copy import deepcopy

import pandas as pd

df = pd.DataFrame(
    {
        "year": {0: 2000, 1: 2000, 2: 2000, 3: 2000, 4: 2000, 5: 2001, 6: 2001},
        "company_id": {0: 1111, 1: 1111, 2: 1111, 3: 6711, 4: 6711, 5: 1111, 6: 1111},
        "studio_id": {0: 1, 1: 2, 2: 3, 3: 1, 4: 2, 5: 2, 6: 3},
        "employees": {
            0: "James wan,Tessa Belle",
            1: "Hanako Nana,Julian Elijah",
            2: "Jerome Venmo,Mandarin Chicken",
            3: "Hainz Chin,Jimmy Kim",
            4: "Gana Cho,Jesus Christ",
            5: "Hanako Nana,Julian Elijah,James Wan",
            6: "Jerome Venmo,Mandarin Chicken",
        },
    }
)

You could collect all employees for each year/company_id pair, like this:

employees = {
    (year, company_id): list(
        df.loc[
            (df["year"] == year) & (df["company_id"] == company_id), "employees"
        ].values
    )
    for year in df["year"].unique()
    for company_id in df["company_id"].unique()
}
employees = {key: value for key, value in employees.items() if value}  # remove empty values

And then, in a new column named "other_studio_emp", update each row with employees that are not already listed in "employees" column, like this:

for i, row in df.iterrows():
    other_employees = deepcopy(employees)[row["year"], row["company_id"]]
    other_employees.remove(row["employees"])
    df.loc[i, "other_studio_emp"] = ",".join(other_employees)

And here is the expected result:

print(df)
# Outputs
   year  company_id  studio_id                            employees  \
0  2000        1111          1                James wan,Tessa Belle   
1  2000        1111          2            Hanako Nana,Julian Elijah   
2  2000        1111          3        Jerome Venmo,Mandarin Chicken   
3  2000        6711          1                 Hainz Chin,Jimmy Kim   
4  2000        6711          2                Gana Cho,Jesus Christ   
5  2001        1111          2  Hanako Nana,Julian Elijah,James Wan   
6  2001        1111          3        Jerome Venmo,Mandarin Chicken   

                                          other_studio_emp  
0  Hanako Nana,Julian Elijah,Jerome Venmo,Mandarin Chicken  
1      James wan,Tessa Belle,Jerome Venmo,Mandarin Chicken  
2          James wan,Tessa Belle,Hanako Nana,Julian Elijah  
3                                    Gana Cho,Jesus Christ  
4                                     Hainz Chin,Jimmy Kim  
5                            Jerome Venmo,Mandarin Chicken  
6                      Hanako Nana,Julian Elijah,James Wan

CodePudding user response：

import pandas as pd
import csv
import copy

filepath = "C:/Users/Untitled Folder/studio_level_dev.csv"
df = pd.read_csv(filepath,encoding='utf-8')
studio_dev_emp_year = {
    (year, dev_parent): list(
        df.loc[
            (df["year"] == year) & (df["dev_parent"] == dev_parent), "studio_dev_emp_year"
        ].values
    )
    for year in df["year"].unique()
    for dev_parent in df["dev_parent"].unique()
}
studio_dev_emp_year = {key: value for key, value in studio_dev_emp_year.items() if value}

for i, row in df.iterrows():
    other_employees = copy.deepcopy(studio_dev_emp_year)[row["year"], row["dev_parent"]]
    other_employees.remove(row["studio_dev_emp_year"])
    df[i]= df.loc[df[i].notna(), "other_studio_emp"] = ",".join(other_employees)
    
df.to_csv('studio_level.csv',mode='a',index=False, encoding="utf-8")

I think I figure out the deepcopy issue. now I'm getting an error saying

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-15-570528ccc15c> in <module>
     19     other_employees = copy.deepcopy(studio_dev_emp_year)[row["year"], row["dev_parent"]]
     20     other_employees.remove(row["studio_dev_emp_year"])
---> 21     df[i]= df.loc[df[i].notna(), "other_studio_emp"] = ",".join(other_employees)
     22 
     23 df.to_csv('studio_level.csv',mode='a',index=False, encoding="utf-8")

TypeError: sequence item 0: expected str instance, float found

I think they find float because there are NaN or empty cases. I tried to mitigate this by filtering but it doesn't seem to work... any ideas?