how to create a multilabel dataset from 3 dataframes-CodePudding

I have 3 dataframes and they have only one column: text

df1

text
I have a car
he has a bus

df1.shape = (10000,1)

df2

text
He likes orange
She ate the banana

df2.shape = (10000,1)

df3

text
Microsoft is a TI company
SpaceX is a Aerospacial company

df3.shape = (10000,1)

I want to create another dataframe, merging df1, df2 and df3 to get this as output:

text                               vehicle      fruits     companys
I have a car                          1           0           0
he has a bus                          1           0           0
He likes orange                       0           1           0
She ate the banana                    0           1           0
Microsoft is a TI company             0           0           1
SpaceX is a Aerospacial company       0           0           1

output.shape = (30000,4)

How can I do this?

CodePudding user response：

There are various ways to achieve what OP desired.

IMO what OP wants is to check if there are vehicles, fruits or companies the the strings.

In order to do that, one will need to first define what is a vehicle, a fruit or a company. For that, one can create a list for each (the lists can be improved)

vehicles = ["car", "bus", "motorcycle", "airplane", "train", "boat", "ship", "helicopter", "submarine", "rocket", "spaceship"]
fruits = ["banana", "apple", "orange", "grape", "strawberry", "watermelon", "cherry", "peach", "pear", "mango", "pineapple"]
companies = ["Microsoft", "Apple", "Google", "Amazon", "Facebook", "Tesla", "SpaceX", "Boeing", "Airbus", "Lockheed", "NASA"]

Now, with the lists, one can merge the dataframes with pandas.concat

df_merge = pd.concat([df1, df2, df3], axis=0, ignore_index=True)

[Out]:
                              text
0                     I have a car
1                     he has a bus
2                  He likes orange
3               She ate the banana
4        Microsoft is a TI company
5  SpaceX is a Aerospacial company

And now, with the merged dataframe, one can check if the values in the lists above are present in the rows.

We start with the vehicles

df_merge['vehicles'] = df_merge['text'].apply(lambda x: sum([x.count(i) for i in vehicles]))

[Out]:
                              text  vehicles
0                     I have a car         1
1                     he has a bus         1
2                  He likes orange         0
3               She ate the banana         0
4        Microsoft is a TI company         0
5  SpaceX is a Aerospacial company         0

Now we move to fruits

df_merge['fruits'] = df_merge['text'].apply(lambda x: sum([x.count(i) for i in fruits]))

[Out]:

                              text  vehicles  fruits
0                     I have a car         1       0
1                     he has a bus         1       0
2                  He likes orange         0       1
3               She ate the banana         0       1
4        Microsoft is a TI company         0       0
5  SpaceX is a Aerospacial company         0       0

Finally, we do it for companies

df_merge['companies'] = df_merge['text'].apply(lambda x: sum([x.count(i) for i in companies]))# Print the result

[Out]:

                              text  vehicles  fruits  companies
0                     I have a car         1       0          0
1                     he has a bus         1       0          0
2                  He likes orange         0       1          0
3               She ate the banana         0       1          0
4        Microsoft is a TI company         0       0          1
5  SpaceX is a Aerospacial company         0       0          1

Notes:

Even though out of scope for this example, this approach has, at least, one limitation. More specifically, if a string has an orange vehicle, for example She has an orange bus, it will detect both a vehicle and a fruit. If one wants to accommodate that, one will have to do it from here.
Apart from the point above, there are more that can happen, however, to consider all, one would need to have access to the full dataframe.

CodePudding user response：

Try this

df1 = pd.DataFrame({"text":["I have a car","he has a bus"]})
df1["vehicle"] = 1
df2 = pd.DataFrame({"text":["He likes orange","She ate the banana"]})
df2["fruits"] = 1
df3 = pd.DataFrame({"text":["Microsoft is a TI company","SpaceX is a Aerospacial company"]})
df3["companys"] = 1

df4 = pd.concat([df1,df2,df3])
df4.fillna(0,inplace=True)
df4.index = range(0,df4.shape[0])

Output of df4