Imagine I have a dataframe that contains a candidate and his skills in varrious languages both written and spoken:
df = pd.DataFrame({'candidate': ['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd', 'd'],
'type': ['spoken', 'written', 'spoken', 'written', 'spoken', 'written', 'spoken', 'written', 'written', 'written'],
'language': ['English', 'German', 'French', 'English', 'English', 'English', 'French', 'English', 'German', 'French'],
'skill': [5, 4, 4, 6, 8, 1, 3, 5, 2, 2]})
result:
candidate type language skill
a spoken English 5
a written German 4
a spoken French 4
b written English 6
b spoken English 8
c written English 1
c spoken French 3
d written English 5
d written German 2
d written French 2
and another df with languages:
languages = pd.DataFrame({'language': ['English', 'English', 'French', 'French', 'German', 'German'],
'type': ['spoken', 'written', 'spoken', 'written', 'spoken', 'written']})
result:
language type
0 English spoken
1 English written
2 French spoken
3 French written
4 German spoken
5 German written
What I need to get is a dataframe that combines df and all possible combinations of its merge with languages, so:
candidate type language skill
a spoken English 5
a written English NA
a spoken German NA
a written German 4
a spoken French 4
a written French NA
b spoken English 8
b written English 6
b spoken French NA
b written French NA
...
d spoken English NA
d written English 5
d spoken French NA
d written French 2
d spoken German NA
d written German 2
and so on. I was trying to add a 'valid' column filled with 'valid' value and use all kind of merges on these dataframes but it always returns only df. Is there any swift way to cope with it in pandas?
CodePudding user response:
Try:
def fn(x):
x = x.merge(languages, how="outer")
x["candidate"] = x["candidate"].ffill().bfill()
return x
df = (
df.groupby("candidate")
.apply(fn)
.reset_index(drop=True)
.sort_values(["candidate", "language", "type"])
)
print(df)
Prints:
candidate type language skill
0 a spoken English 5.0
3 a written English NaN
2 a spoken French 4.0
4 a written French NaN
5 a spoken German NaN
1 a written German 4.0
7 b spoken English 8.0
6 b written English 6.0
8 b spoken French NaN
9 b written French NaN
10 b spoken German NaN
11 b written German NaN
14 c spoken English NaN
12 c written English 1.0
13 c spoken French 3.0
15 c written French NaN
16 c spoken German NaN
17 c written German NaN
21 d spoken English NaN
18 d written English 5.0
22 d spoken French NaN
20 d written French 2.0
23 d spoken German NaN
19 d written German 2.0
CodePudding user response:
IIUC, you can use complete from pyjanitor to expose the missing combinations:
# pip install pyjanitor
import janitor
import pandas as pd
df.complete('candidate', 'type', 'language')
candidate type language skill
0 a spoken English 5.0
1 a spoken German NaN
2 a spoken French 4.0
3 a written English NaN
4 a written German 4.0
5 a written French NaN
6 b spoken English 8.0
7 b spoken German NaN
8 b spoken French NaN
9 b written English 6.0
10 b written German NaN
11 b written French NaN
12 c spoken English NaN
13 c spoken German NaN
14 c spoken French 3.0
15 c written English 1.0
16 c written German NaN
17 c written French NaN
18 d spoken English NaN
19 d spoken German NaN
20 d spoken French NaN
21 d written English 5.0
22 d written German 2.0
23 d written French 2.0
for your use case, it is not necessary - you can use the languages dataframe - pass it as a dictionary instead:
languages = {'language': ['English', 'English', 'French',
'French', 'German', 'German'],
'type': ['spoken', 'written', 'spoken',
'written', 'spoken', 'written']}
df.complete('candidate', languages)
candidate type language skill
0 a spoken English 5.0
1 a written English NaN
2 a spoken French 4.0
3 a written French NaN
4 a spoken German NaN
5 a written German 4.0
6 b spoken English 8.0
7 b written English 6.0
8 b spoken French NaN
9 b written French NaN
10 b spoken German NaN
11 b written German NaN
12 c spoken English NaN
13 c written English 1.0
14 c spoken French 3.0
15 c written French NaN
16 c spoken German NaN
17 c written German NaN
18 d spoken English NaN
19 d written English 5.0
20 d spoken French NaN
21 d written French 2.0
22 d spoken German NaN
23 d written German 2.0
If you are not keen on importing another library, you can do this within Pandas, in an equally efficient way :
index = (pd.MultiIndex
.from_product(
[df.candidate.unique(),
df['type'].unique(),
df['language'].unique()],
names = ['candidate', 'type', 'language']
))
index = pd.DataFrame([], index = index)
index.merge(df, how = 'outer', on = index.index.names)
candidate type language skill
0 a spoken English 5.0
1 a spoken German NaN
2 a spoken French 4.0
3 a written English NaN
4 a written German 4.0
5 a written French NaN
6 b spoken English 8.0
7 b spoken German NaN
8 b spoken French NaN
9 b written English 6.0
10 b written German NaN
11 b written French NaN
12 c spoken English NaN
13 c spoken German NaN
14 c spoken French 3.0
15 c written English 1.0
16 c written German NaN
17 c written French NaN
18 d spoken English NaN
19 d spoken German NaN
20 d spoken French NaN
21 d written English 5.0
22 d written German 2.0
23 d written French 2.0