Home > Mobile >  Extending a pandas df to a version that contains all possible combinations of other df columns by gr
Extending a pandas df to a version that contains all possible combinations of other df columns by gr

Time:09-18

Imagine I have a dataframe that contains a candidate and his skills in varrious languages both written and spoken:

df = pd.DataFrame({'candidate': ['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd', 'd'],
                   'type': ['spoken', 'written', 'spoken', 'written', 'spoken', 'written', 'spoken', 'written', 'written', 'written'],
                   'language': ['English', 'German', 'French', 'English', 'English', 'English', 'French', 'English', 'German', 'French'],
                   'skill': [5, 4, 4, 6, 8, 1, 3, 5, 2, 2]})

result:

candidate   type    language    skill
a           spoken  English     5
a           written German      4
a           spoken  French      4
b           written English     6
b           spoken  English     8
c           written English     1
c           spoken  French      3
d           written English     5
d           written German      2
d           written French      2

and another df with languages:

languages = pd.DataFrame({'language': ['English', 'English', 'French', 'French', 'German', 'German'],
                          'type': ['spoken', 'written', 'spoken', 'written', 'spoken', 'written']})

result:

    language    type
0   English spoken
1   English written
2   French  spoken
3   French  written
4   German  spoken
5   German  written

What I need to get is a dataframe that combines df and all possible combinations of its merge with languages, so:

candidate   type    language    skill
a           spoken  English     5
a           written English     NA
a           spoken  German      NA
a           written German      4
a           spoken  French      4
a           written French      NA
b           spoken  English     8
b           written English     6
b           spoken  French      NA
b           written French      NA
...
d           spoken  English     NA
d           written English     5
d           spoken  French      NA
d           written French      2
d           spoken  German      NA
d           written German      2

and so on. I was trying to add a 'valid' column filled with 'valid' value and use all kind of merges on these dataframes but it always returns only df. Is there any swift way to cope with it in pandas?

CodePudding user response:

Try:

def fn(x):
    x = x.merge(languages, how="outer")
    x["candidate"] = x["candidate"].ffill().bfill()
    return x


df = (
    df.groupby("candidate")
    .apply(fn)
    .reset_index(drop=True)
    .sort_values(["candidate", "language", "type"])
)
print(df)

Prints:

   candidate     type language  skill
0          a   spoken  English    5.0
3          a  written  English    NaN
2          a   spoken   French    4.0
4          a  written   French    NaN
5          a   spoken   German    NaN
1          a  written   German    4.0
7          b   spoken  English    8.0
6          b  written  English    6.0
8          b   spoken   French    NaN
9          b  written   French    NaN
10         b   spoken   German    NaN
11         b  written   German    NaN
14         c   spoken  English    NaN
12         c  written  English    1.0
13         c   spoken   French    3.0
15         c  written   French    NaN
16         c   spoken   German    NaN
17         c  written   German    NaN
21         d   spoken  English    NaN
18         d  written  English    5.0
22         d   spoken   French    NaN
20         d  written   French    2.0
23         d   spoken   German    NaN
19         d  written   German    2.0

CodePudding user response:

IIUC, you can use complete from pyjanitor to expose the missing combinations:

# pip install pyjanitor
import janitor
import pandas as pd

df.complete('candidate', 'type', 'language')
   candidate     type language  skill
0          a   spoken  English    5.0
1          a   spoken   German    NaN
2          a   spoken   French    4.0
3          a  written  English    NaN
4          a  written   German    4.0
5          a  written   French    NaN
6          b   spoken  English    8.0
7          b   spoken   German    NaN
8          b   spoken   French    NaN
9          b  written  English    6.0
10         b  written   German    NaN
11         b  written   French    NaN
12         c   spoken  English    NaN
13         c   spoken   German    NaN
14         c   spoken   French    3.0
15         c  written  English    1.0
16         c  written   German    NaN
17         c  written   French    NaN
18         d   spoken  English    NaN
19         d   spoken   German    NaN
20         d   spoken   French    NaN
21         d  written  English    5.0
22         d  written   German    2.0
23         d  written   French    2.0

for your use case, it is not necessary - you can use the languages dataframe - pass it as a dictionary instead:

languages = {'language': ['English', 'English', 'French', 
                          'French', 'German', 'German'], 
             'type': ['spoken', 'written', 'spoken', 
                      'written', 'spoken', 'written']}
df.complete('candidate', languages)
   candidate     type language  skill
0          a   spoken  English    5.0
1          a  written  English    NaN
2          a   spoken   French    4.0
3          a  written   French    NaN
4          a   spoken   German    NaN
5          a  written   German    4.0
6          b   spoken  English    8.0
7          b  written  English    6.0
8          b   spoken   French    NaN
9          b  written   French    NaN
10         b   spoken   German    NaN
11         b  written   German    NaN
12         c   spoken  English    NaN
13         c  written  English    1.0
14         c   spoken   French    3.0
15         c  written   French    NaN
16         c   spoken   German    NaN
17         c  written   German    NaN
18         d   spoken  English    NaN
19         d  written  English    5.0
20         d   spoken   French    NaN
21         d  written   French    2.0
22         d   spoken   German    NaN
23         d  written   German    2.0

If you are not keen on importing another library, you can do this within Pandas, in an equally efficient way :

index = (pd.MultiIndex
           .from_product(
                [df.candidate.unique(), 
                 df['type'].unique(), 
                 df['language'].unique()], 
                names = ['candidate', 'type', 'language']
         ))
index = pd.DataFrame([], index = index)
index.merge(df, how = 'outer', on = index.index.names)
   candidate     type language  skill
0          a   spoken  English    5.0
1          a   spoken   German    NaN
2          a   spoken   French    4.0
3          a  written  English    NaN
4          a  written   German    4.0
5          a  written   French    NaN
6          b   spoken  English    8.0
7          b   spoken   German    NaN
8          b   spoken   French    NaN
9          b  written  English    6.0
10         b  written   German    NaN
11         b  written   French    NaN
12         c   spoken  English    NaN
13         c   spoken   German    NaN
14         c   spoken   French    3.0
15         c  written  English    1.0
16         c  written   German    NaN
17         c  written   French    NaN
18         d   spoken  English    NaN
19         d   spoken   German    NaN
20         d   spoken   French    NaN
21         d  written  English    5.0
22         d  written   German    2.0
23         d  written   French    2.0
  • Related