I want to join these two dataframes:
df1 = pd.DataFrame({"text": ["example one", "example word one", "example two", "example sentance one"],
"label_1": ["O O", "O W O", "O O", "O S O"]})
df2 = pd.DataFrame({"text": ["example one", "example word one", "total example", "example sentance one"],
"label_2": ["O N", "O O N", "O O", "O O N"]})
text label_1
0 example one O O
1 example word one O W O
2 example two O O
3 example sentance one O S O
text label_2
0 example one O N
1 example word one O O N
2 total example O O
3 example sentance one O O N
My desired output is this:
text label_1 label_2
0 example one O O O N
1 example word one O W O O O N
2 example two O O NaN
3 example sentance one O S O O O N
5 total example NaN O O
So text
column should have unique values, and final dataframe should have label_1
and label_2
columns with matching values. NaN can be in the result dataframe, but the best thing would be to replace NaN with O
values that matches the number of words in text
column (for that row). So if that row has 3 words, NaN should be replaced with O O O
.
I have tried this, but I haven't got the desired output:
df = pd.concat([df1, df2])
text label_1 label_2
0 example one O O NaN
1 example word one O W O NaN
2 example two O O NaN
3 example sentance one O S O NaN
0 example one NaN O N
1 example word one NaN O O N
2 total example NaN O O
3 example sentance one NaN O O N
CodePudding user response:
Using a combination of DataFrame.merge
and DataFrame.fillna
df = df1.merge(df2, how="outer", on="text")
fill_series = df["text"].str.split().str.len().apply(lambda x: " ".join("0" * x))
fill_cols = ["label_1", "label_2"]
fill_dict = {k: fill_series for k in fill_cols}
df = df.fillna(fill_dict)
Outputs:
# text label_1 label_2
# 0 example one O O O N
# 1 example word one O W O O O N
# 2 example two O O 0 0
# 3 example sentance one O S O O O N
# 4 total example 0 0 O O
fill_series
creates a pd.Series
of strings to use for filling.
fill_cols
are the columns that need filling.
fill_dict
is the dictionary that applies the strings from the Series
to the DataFrame
.