Home > other >  How i can create a function which analize text in rows in 1 columns and rewrite in others
How i can create a function which analize text in rows in 1 columns and rewrite in others

Time:12-17

For example - i have smth like this

all info in 1 column

header Company price data
1 done
2 nan
3 comp1
4 nan
5 price:222
6 result:ok, 22.05.19
7 nan
8 done
9 nan
10 comp2
11 nan
12 price:2895
13 result:no, 15.05.19
14 nan
15 done
16 and ctr

i try to find ot how to creat a function that will put info in other columns the right result header 1

num Company price result date
1 Comp1 222 ok 22.05.19
2 Comp2 2895 no 15.05.19
etc

CodePudding user response:

The best idea is likely to import the data correctly from the beginning.

That said, and mostly for fun, you can fix this format using a regex and reshaping:

# remove NaNs, and "done" rows
s = df['header'].loc[df['header'].ne('done')].dropna()

# extract the Price/Result/Date
pattern = r'price:(?P<Price>\d )|result:(?P<Result>\w ), (?P<Date>[\d.] )'
df2 = s.str.extract(pattern)

# identify potential company names
m = df2.isna().all(axis=1)

# reshape
out = (df2
    .assign(Company=s.where(m))
    .set_index(m.cumsum(), append=True)
    .droplevel(0)
    .stack().unstack()
    .dropna(subset=['Price', 'Result', 'Date'], how='all')
)

Output:

  Price Result      Date Company
1   222     ok  22.05.19   comp1
2  2895     no  15.05.19   comp2

CodePudding user response:

This is answer made with mozway's good idea.

Example

data = ['done', None, 'comp1', None, 'price:222', 'result:ok, 22.05.19', None, 
        'done', None, 'comp2', None, 'price:2895','result:no, 15.05.19', None, 'done']
df = pd.DataFrame(data, columns=['header'])

df

    header
0   done
1   None
2   comp1
3   None
4   price:222
5   result:ok, 22.05.19
6   None
7   done
8   None
9   comp2
10  None
11  price:2895
12  result:no, 15.05.19
13  None
14  done

Code

pattern = r'(?P<Company>comp\d)|price:(?P<Price>\d )|result:(?P<Result>\w ), (?P<Date>[\d.] )'
grouper = df['header'].str.contains('^comp').cumsum()
out = (df['header'].str.extract(pattern).groupby(grouper).first()
       .dropna(subset=['Company']).set_index('Company'))

out

        Price   Result  Date
Company         
comp1   222     ok      22.05.19
comp2   2895    no      15.05.19

CodePudding user response:

I can see the company, status, dates position have pattern with price. If it the same with your actual data, you can do as below:

Input:

df = pd.DataFrame({'header':['done', None, 'comp1', None, 'price:222', 'result:ok, 22.05.19', None, 'Done', None,
                           'comp2', None, 'price:2895',  'result:no, 15.05.19', None, 'Done', 'and ctr'],'Company':None, 'price':None, 'data':None})

Code:

ls=list(df.header.str.split(':')) 
pd.DataFrame([{'Company':ls[i-2][0], 'price': v[1], 'result': ls[i 1][1].split(',')[0], 'date':ls[i 1][1].split(',')[1]} 
                                   for i, v in enumerate(ls) if 'price'in str(v)])

Output:

   Company  price   result  date
0   comp1   222     ok      22.05.19
1   comp2   2895    no      15.05.19
  • Related