I am preparing dataset for openai model training. For an example, I have data in csv in below format
df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3], 'new':['apple', 'banana', 'pear']})
df
bar foo new
0 1 a apple
1 2 b banana
2 3 c pear
I want to create a new column called 'prompt' with combine values of bar and foo columns
bar foo new prompt
0 1 a apple foo: a, bar: 1
1 2 b banana foo: b, bar: 2
2 3 c pear foo:c, bar: 3
there is similar example here but it doesn't add the column names inside the combined column
CodePudding user response:
df.apply
is very popular but should be avoided whenever possible.
Instead use vectorized methods. Convert the relevant columns to dict and remove the quotes:
df["prompt"] = df[["foo", "bar"]].to_dict(orient="index")
df["prompt"] = df["prompt"].astype(str).replace(r"'", "", regex=True)
# foo bar new prompt
# 0 a 1 apple {foo: a, bar: 1}
# 1 b 2 banana {foo: b, bar: 2}
# 2 c 3 pear {foo: c, bar: 3}
Note that your comment included braces but your post did not. If you also want to remove the curly braces, add them to the regex:
df["prompt"] = df["prompt"].astype(str).replace(r"[{'}]", "", regex=True)
# foo bar new prompt
# 0 a 1 apple foo: a, bar: 1
# 1 b 2 banana foo: b, bar: 2
# 2 c 3 pear foo: c, bar: 3
Details
First convert the relevant columns to_dict
oriented by index:
df["prompt"] = df[["foo", "bar"]].to_dict(orient="index")
# foo bar new prompt
# 0 a 1 apple {'foo': 'a', 'bar': 1}
# 1 b 2 banana {'foo': 'b', 'bar': 2}
# 2 c 3 pear {'foo': 'c', 'bar': 3}
Then use astype
to convert it to str
type and replace
the dict symbols:
df["prompt"] = df["prompt"].astype(str).replace(r"[{'}]", "", regex=True)
# foo bar new prompt
# 0 a 1 apple foo: a, bar: 1
# 1 b 2 banana foo: b, bar: 2
# 2 c 3 pear foo: c, bar: 3
CodePudding user response:
df = pd.DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3], 'new':['apple', 'banana', 'pear']})
df['prompt'] = df.apply(lambda x: x.to_json(), axis=1)
df
This would work? You get:
foo bar new prompt
0 a 1 apple {"foo":"a","bar":1,"new":"apple"}
1 b 2 banana {"foo":"b","bar":2,"new":"banana"}
2 c 3 pear {"foo":"c","bar":3,"new":"pear"}
CodePudding user response:
Combine multiple cells in a row using apply and lambda
df['prompt']=df.apply(lambda k: 'foo: ' k['foo'] ', bar: ' str(k['bar']), axis=1)
This will work?
foo bar new prompt
0 a 1 apple foo: a, bar: 1
1 b 2 banana foo: b, bar: 2
2 c 3 pear foo: c, bar: 3