For example - i have smth like this
all info in 1 column
header | Company | price | data | |
---|---|---|---|---|
1 | done | |||
2 | nan | |||
3 | comp1 | |||
4 | nan | |||
5 | price:222 | |||
6 | result:ok, 22.05.19 | |||
7 | nan | |||
8 | done | |||
9 | nan | |||
10 | comp2 | |||
11 | nan | |||
12 | price:2895 | |||
13 | result:no, 15.05.19 | |||
14 | nan | |||
15 | done | |||
16 | and ctr |
i try to find ot how to creat a function that will put info in other columns the right result header 1
num | Company | price | result | date |
---|---|---|---|---|
1 | Comp1 | 222 | ok | 22.05.19 |
2 | Comp2 | 2895 | no | 15.05.19 |
etc |
CodePudding user response:
The best idea is likely to import the data correctly from the beginning.
That said, and mostly for fun, you can fix this format using a regex and reshaping:
# remove NaNs, and "done" rows
s = df['header'].loc[df['header'].ne('done')].dropna()
# extract the Price/Result/Date
pattern = r'price:(?P<Price>\d )|result:(?P<Result>\w ), (?P<Date>[\d.] )'
df2 = s.str.extract(pattern)
# identify potential company names
m = df2.isna().all(axis=1)
# reshape
out = (df2
.assign(Company=s.where(m))
.set_index(m.cumsum(), append=True)
.droplevel(0)
.stack().unstack()
.dropna(subset=['Price', 'Result', 'Date'], how='all')
)
Output:
Price Result Date Company
1 222 ok 22.05.19 comp1
2 2895 no 15.05.19 comp2
CodePudding user response:
This is answer made with mozway's good idea.
Example
data = ['done', None, 'comp1', None, 'price:222', 'result:ok, 22.05.19', None,
'done', None, 'comp2', None, 'price:2895','result:no, 15.05.19', None, 'done']
df = pd.DataFrame(data, columns=['header'])
df
header
0 done
1 None
2 comp1
3 None
4 price:222
5 result:ok, 22.05.19
6 None
7 done
8 None
9 comp2
10 None
11 price:2895
12 result:no, 15.05.19
13 None
14 done
Code
pattern = r'(?P<Company>comp\d)|price:(?P<Price>\d )|result:(?P<Result>\w ), (?P<Date>[\d.] )'
grouper = df['header'].str.contains('^comp').cumsum()
out = (df['header'].str.extract(pattern).groupby(grouper).first()
.dropna(subset=['Company']).set_index('Company'))
out
Price Result Date
Company
comp1 222 ok 22.05.19
comp2 2895 no 15.05.19
CodePudding user response:
I can see the company, status, dates position have pattern with price. If it the same with your actual data, you can do as below:
Input:
df = pd.DataFrame({'header':['done', None, 'comp1', None, 'price:222', 'result:ok, 22.05.19', None, 'Done', None,
'comp2', None, 'price:2895', 'result:no, 15.05.19', None, 'Done', 'and ctr'],'Company':None, 'price':None, 'data':None})
Code:
ls=list(df.header.str.split(':'))
pd.DataFrame([{'Company':ls[i-2][0], 'price': v[1], 'result': ls[i 1][1].split(',')[0], 'date':ls[i 1][1].split(',')[1]}
for i, v in enumerate(ls) if 'price'in str(v)])
Output:
Company price result date
0 comp1 222 ok 22.05.19
1 comp2 2895 no 15.05.19