I have a data column containg the values as shown in the image and i want to split them into different columns . Data type of column B is object/string
Column A | Column B |
---|---|
12 | ("apple",123,"six",03-10-2020) |
10 | ("banana","two") |
5 | ("mango",156,02-10-2020) |
and i want to split it into different columns
CodePudding user response:
@nik I see one additional challenge here. From you data frame, I see no pattern in you "Column B". So I suggest you to write a function to arrange the tuples in a standard format as show below (For help check this link)
Old Col B | New Col B |
---|---|
("banana","two") | ("banana", None,"two", None) |
("mango",156,02-10-2020) | ("mango",156, None,02-10-2020) |
After your data is in acceptable standard format, you use pandas inbuilt function to split your data. Example is below
df['New Col B'].str.split(',', expand=True)
0 | 1 | 2 | 3 |
---|---|---|---|
"banana" | None | "two" | None |
"mango" | 156 | None | 02-10-2020 |
CodePudding user response:
It would be easy if Column B has fixed four values for 'fruit', 'code', 'quantity', and 'date'. However, the number of values are not fixed here, which makes hard to solve. I assume that there are three patterns as you exampled and solve the problem, as follows:
import pandas as pd
df = pd.DataFrame({
'Column A': [12, 10, 5],
'Column B': ['("apple",123,"six",03-10-2020)', '("banana","two")', '("mango",156,02-10-2020)']
})
fruits = []
codes = []
quantities = []
dates = []
for i, row in df.iterrows():
values = row['Column B'][1:-1].split(',')
if len(values) == 1:
fruit, code, quantity, date = values[0], None, None, None
elif len(values) == 2:
fruit, code, quantity, date = values[0], None, values[1], None
elif len(values) == 3:
fruit, code, quantity, date = values[0], values[1], None, values[2]
elif len(values) == 4:
fruit, code, quantity, date = values[0], values[1], values[2], values[3]
else:
pass
fruits.append(fruit)
codes.append(code)
quantities.append(quantity)
dates.append(date)
df['fruit'] = pd.Series(fruits)
df['codes'] = pd.Series(codes)
df['quantities'] = pd.Series(quantities)
df['dates'] = pd.Series(dates)
del df['Column B']
print(df)
# Column A fruit codes quantities dates
#0 12 "apple" 123 "six" 03-10-2020
#1 10 "banana" None "two" None
#2 5 "mango" 156 None 02-10-2020
I assums that if Column B has three values, it would be 'fruit', 'code', and 'date' (for example, "mango",156,02-10-2020), not 'fruit', 'quantity', and 'date'. You might want to modify my code if your pattern is more varied.