Home > other >  Joining pandas dataframes with special condition
Joining pandas dataframes with special condition

Time:10-21

I want to join these two dataframes:

df1 = pd.DataFrame({"text": ["example one", "example word one", "example two", "example sentance one"],
                    "label_1": ["O O", "O W O", "O O", "O S O"]})

df2 = pd.DataFrame({"text": ["example one", "example word one", "total example", "example sentance one"],
                    "label_2": ["O N", "O O N", "O O", "O O N"]})


                   text label_1
0           example one     O O
1      example word one   O W O
2           example two     O O
3  example sentance one   O S O

                   text label_2
0           example one     O N
1      example word one   O O N
2         total example     O O
3  example sentance one   O O N

My desired output is this:

                   text label_1 label_2
0           example one     O O     O N
1      example word one   O W O   O O N
2           example two     O O     NaN
3  example sentance one   O S O   O O N
5         total example     NaN     O O

So text column should have unique values, and final dataframe should have label_1 and label_2 columns with matching values. NaN can be in the result dataframe, but the best thing would be to replace NaN with O values that matches the number of words in text column (for that row). So if that row has 3 words, NaN should be replaced with O O O.

I have tried this, but I haven't got the desired output:

df = pd.concat([df1, df2])

                   text label_1 label_2
0           example one     O O     NaN
1      example word one   O W O     NaN
2           example two     O O     NaN
3  example sentance one   O S O     NaN
0           example one     NaN     O N
1      example word one     NaN   O O N
2         total example     NaN     O O
3  example sentance one     NaN   O O N

CodePudding user response:

Using a combination of DataFrame.merge and DataFrame.fillna

df = df1.merge(df2, how="outer", on="text")

fill_series = df["text"].str.split().str.len().apply(lambda x: " ".join("0" * x))
fill_cols = ["label_1", "label_2"]
fill_dict = {k: fill_series for k in fill_cols}

df = df.fillna(fill_dict)

Outputs:

#                    text label_1 label_2
# 0           example one     O O     O N
# 1      example word one   O W O   O O N
# 2           example two     O O     0 0
# 3  example sentance one   O S O   O O N
# 4         total example     0 0     O O

fill_series creates a pd.Series of strings to use for filling.
fill_cols are the columns that need filling.
fill_dict is the dictionary that applies the strings from the Series to the DataFrame.

  • Related