I have a data frame that has a list of names by company and subgroups of studios.
year | company_id | studio_id | employees |
---|---|---|---|
2000 | 1111 | 1 | James wan,Tessa Belle |
2000 | 1111 | 2 | Hanako Nana,Julian Elijah |
2000 | 1111 | 3 | Jerome Venmo,Mandarin Chicken |
2000 | 6711 | 1 | Hainz Chin,Jimmy Kim |
2000 | 6711 | 2 | Gana Cho,Jesus Christ |
2001 | 1111 | 2 | Hanako Nana,Julian Elijah,James Wan |
2001 | 1111 | 3 | Jerome Venmo,Mandarin Chicken |
I want to add another column that has other studio's employees for each year.
year | company_id | studio_id | employees | other_studio_emp |
---|---|---|---|---|
2000 | 1111 | 1 | James wan,Tessa Belle | Hanako Nana,Julian Elijah,Jerome Venmo,Mandarin Chicken |
2000 | 1111 | 2 | Hanako Nana,Julian Elijah | James wan,Tessa Belle,Jerome Venmo,Mandarin Chicken |
2000 | 1111 | 3 | Jerome Venmo,Mandarin Chicken | James wan,Tessa Belle,Hanako Nana,Julian Elijah |
2000 | 6711 | 1 | Hainz Chin,Jimmy Kim | Gana Cho,Jesus Christ |
2000 | 6711 | 2 | Gana Cho,Jesus Christ | Hainz Chin,Jimmy Kim |
2001 | 1111 | 2 | Hanako Nana,Julian Elijah,James Wan | Jerome Venmo,Mandarin Chicken |
2001 | 1111 | 3 | Jerome Venmo,Mandarin Chicken | Hanako Nana,Julian Elijah,James Wan |
How should I apporach this?
CodePudding user response:
So, given the following dataframe:
from copy import deepcopy
import pandas as pd
df = pd.DataFrame(
{
"year": {0: 2000, 1: 2000, 2: 2000, 3: 2000, 4: 2000, 5: 2001, 6: 2001},
"company_id": {0: 1111, 1: 1111, 2: 1111, 3: 6711, 4: 6711, 5: 1111, 6: 1111},
"studio_id": {0: 1, 1: 2, 2: 3, 3: 1, 4: 2, 5: 2, 6: 3},
"employees": {
0: "James wan,Tessa Belle",
1: "Hanako Nana,Julian Elijah",
2: "Jerome Venmo,Mandarin Chicken",
3: "Hainz Chin,Jimmy Kim",
4: "Gana Cho,Jesus Christ",
5: "Hanako Nana,Julian Elijah,James Wan",
6: "Jerome Venmo,Mandarin Chicken",
},
}
)
You could collect all employees for each year/company_id pair, like this:
employees = {
(year, company_id): list(
df.loc[
(df["year"] == year) & (df["company_id"] == company_id), "employees"
].values
)
for year in df["year"].unique()
for company_id in df["company_id"].unique()
}
employees = {key: value for key, value in employees.items() if value} # remove empty values
And then, in a new column named "other_studio_emp", update each row with employees that are not already listed in "employees" column, like this:
for i, row in df.iterrows():
other_employees = deepcopy(employees)[row["year"], row["company_id"]]
other_employees.remove(row["employees"])
df.loc[i, "other_studio_emp"] = ",".join(other_employees)
And here is the expected result:
print(df)
# Outputs
year company_id studio_id employees \
0 2000 1111 1 James wan,Tessa Belle
1 2000 1111 2 Hanako Nana,Julian Elijah
2 2000 1111 3 Jerome Venmo,Mandarin Chicken
3 2000 6711 1 Hainz Chin,Jimmy Kim
4 2000 6711 2 Gana Cho,Jesus Christ
5 2001 1111 2 Hanako Nana,Julian Elijah,James Wan
6 2001 1111 3 Jerome Venmo,Mandarin Chicken
other_studio_emp
0 Hanako Nana,Julian Elijah,Jerome Venmo,Mandarin Chicken
1 James wan,Tessa Belle,Jerome Venmo,Mandarin Chicken
2 James wan,Tessa Belle,Hanako Nana,Julian Elijah
3 Gana Cho,Jesus Christ
4 Hainz Chin,Jimmy Kim
5 Jerome Venmo,Mandarin Chicken
6 Hanako Nana,Julian Elijah,James Wan
CodePudding user response:
import pandas as pd
import csv
import copy
filepath = "C:/Users/Untitled Folder/studio_level_dev.csv"
df = pd.read_csv(filepath,encoding='utf-8')
studio_dev_emp_year = {
(year, dev_parent): list(
df.loc[
(df["year"] == year) & (df["dev_parent"] == dev_parent), "studio_dev_emp_year"
].values
)
for year in df["year"].unique()
for dev_parent in df["dev_parent"].unique()
}
studio_dev_emp_year = {key: value for key, value in studio_dev_emp_year.items() if value}
for i, row in df.iterrows():
other_employees = copy.deepcopy(studio_dev_emp_year)[row["year"], row["dev_parent"]]
other_employees.remove(row["studio_dev_emp_year"])
df[i]= df.loc[df[i].notna(), "other_studio_emp"] = ",".join(other_employees)
df.to_csv('studio_level.csv',mode='a',index=False, encoding="utf-8")
I think I figure out the deepcopy issue. now I'm getting an error saying
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-15-570528ccc15c> in <module>
19 other_employees = copy.deepcopy(studio_dev_emp_year)[row["year"], row["dev_parent"]]
20 other_employees.remove(row["studio_dev_emp_year"])
---> 21 df[i]= df.loc[df[i].notna(), "other_studio_emp"] = ",".join(other_employees)
22
23 df.to_csv('studio_level.csv',mode='a',index=False, encoding="utf-8")
TypeError: sequence item 0: expected str instance, float found
I think they find float because there are NaN or empty cases. I tried to mitigate this by filtering but it doesn't seem to work... any ideas?