How i can create a function which analize text in rows in 1 columns and rewrite in others-CodePudding

For example - i have smth like this

all info in 1 column

	header	Company	price	data
1	done
2	nan
3	comp1
4	nan
5	price:222
6	result:ok, 22.05.19
7	nan
8	done
9	nan
10	comp2
11	nan
12	price:2895
13	result:no, 15.05.19
14	nan
15	done
16	and ctr

i try to find ot how to creat a function that will put info in other columns the right result header 1

num	Company	price	result	date
1	Comp1	222	ok	22.05.19
2	Comp2	2895	no	15.05.19
etc

CodePudding user response：

The best idea is likely to import the data correctly from the beginning.

That said, and mostly for fun, you can fix this format using a regex and reshaping:

# remove NaNs, and "done" rows
s = df['header'].loc[df['header'].ne('done')].dropna()

# extract the Price/Result/Date
pattern = r'price:(?P<Price>\d )|result:(?P<Result>\w ), (?P<Date>[\d.] )'
df2 = s.str.extract(pattern)

# identify potential company names
m = df2.isna().all(axis=1)

# reshape
out = (df2
    .assign(Company=s.where(m))
    .set_index(m.cumsum(), append=True)
    .droplevel(0)
    .stack().unstack()
    .dropna(subset=['Price', 'Result', 'Date'], how='all')
)

Output:

  Price Result      Date Company
1   222     ok  22.05.19   comp1
2  2895     no  15.05.19   comp2

CodePudding user response：

This is answer made with mozway's good idea.

Example

data = ['done', None, 'comp1', None, 'price:222', 'result:ok, 22.05.19', None, 
        'done', None, 'comp2', None, 'price:2895','result:no, 15.05.19', None, 'done']
df = pd.DataFrame(data, columns=['header'])

df

    header
0   done
1   None
2   comp1
3   None
4   price:222
5   result:ok, 22.05.19
6   None
7   done
8   None
9   comp2
10  None
11  price:2895
12  result:no, 15.05.19
13  None
14  done

Code

pattern = r'(?P<Company>comp\d)|price:(?P<Price>\d )|result:(?P<Result>\w ), (?P<Date>[\d.] )'
grouper = df['header'].str.contains('^comp').cumsum()
out = (df['header'].str.extract(pattern).groupby(grouper).first()
       .dropna(subset=['Company']).set_index('Company'))

out

        Price   Result  Date
Company         
comp1   222     ok      22.05.19
comp2   2895    no      15.05.19

CodePudding user response：

I can see the company, status, dates position have pattern with price. If it the same with your actual data, you can do as below:

Input:

df = pd.DataFrame({'header':['done', None, 'comp1', None, 'price:222', 'result:ok, 22.05.19', None, 'Done', None,
                           'comp2', None, 'price:2895',  'result:no, 15.05.19', None, 'Done', 'and ctr'],'Company':None, 'price':None, 'data':None})

Code:

ls=list(df.header.str.split(':')) 
pd.DataFrame([{'Company':ls[i-2][0], 'price': v[1], 'result': ls[i 1][1].split(',')[0], 'date':ls[i 1][1].split(',')[1]} 
                                   for i, v in enumerate(ls) if 'price'in str(v)])

Output:

   Company  price   result  date
0   comp1   222     ok      22.05.19
1   comp2   2895    no      15.05.19