Home > Back-end >  pandas manipulations with columns
pandas manipulations with columns

Time:05-11

I was trying to parse some important data from html tables from database links using pandas and ran into one problem. Code:

import pandas as pd

df_list = pd.read_html('iss1.html', match='Supplier ID')
df_list2 = pd.read_html('iss2.html', match='Attachments:')
df_list3 = pd.read_html('P.html', match='AWS-HO')

df = pd.concat(df_list, axis=1)
df2 = pd.concat(df_list2, axis=1)
df3 = pd.concat(df_list3, axis=1)
df3 = df3.iloc[:, ::-1]

df_rev = df.iloc[:, ::-1]
df2 = df2.iloc[:, ::-1]
df_rev.columns = df_rev.iloc[0]

lc = df_rev[["Code"]]
lc = pd.DataFrame({"Code": df_rev["Code"].values.T.ravel(),})
lc = lc[lc['Code'] != 'SSB tracking']
lc = lc[lc['Code'] != 'USB'] 
lc = lc[lc['Code'] != 'Review']
lc = lc[lc['Code'] != '( Review )'] 


sup = df_rev[["ID"]]
sup = pd.DataFrame({"ID": df_rev["ID"].values.T.ravel(),})
sup = sup[sup['ID'] != 'SupID']

lc_sup = pd.concat([lc, sup], axis=1) # 'group' column
lc_sup['group'] = lc_sup['Lang Code'].isna().cumsum()
lc_sup = lc_sup.sort_values(['group', 'Lang Code'], ascending=True)  # 'group' column
lc_sup = lc_sup[lc_sup['Lang Code'].notna()]   # 'group' column

ids_cons = pd.concat([ids_cons, lc_sup], axis=1)

This is 'ids_cons' DF. I created the "group" column because of problems with NaNs and to sort the values. (lc_sup DF in code above)

The range of each project is individual and specialized according to the "group" column. Each identical digit refers to separate projects. In my example, there are 4 projects together.

Code    Supplier ID group
1   d   C0003   0
2   e   R9996   0
3   f   O0001   0
4   j   MT0021  0
5   k   DY0001  0
6   p   B0114   0
7   z   J0002   0
57  d   T0096   48
58  e   T0015   48
59  f   R0167   48
60  i   G0004   48
61  j   T0021   48
62  k   A0003   48
63  p   S0035   48
64  z   F0006   48
65  z   C0002   48
113 j   R0009   94
114 z   A0013   94
169 e   O0001   147
170 z   A0013   147
281 d   C0003   254
282 e   O0001   254
283 f   N0183   254
284 i   O0001   254

So what I want to do now is add a project name for each of the 4 projects. I have a separate DF(just a list) with project names that are grouped into one column. The problem is that required project name appears only 1time here and I need to add it to every project by 'group' column.

Previous example added Project names DF:

Code    Supplier ID group  Project name
1   d   C0003   0          E01
2   e   R9996   0          E02
3   f   O0001   0          E03
4   j   MT0021  0          E04
5   k   DY0001  0          E05
6   p   B0114   0
7   z   J0002   0
57  d   T0096   48
58  e   T0015   48
59  f   R0167   48
60  i   G0004   48
61  j   T0021   48
62  k   A0003   48
63  p   S0035   48
64  z   F0006   48
65  z   C0002   48
113 j   R0009   94
114 z   A0013   94
169 e   O0001   147
170 z   A0013   147
281 d   C0003   254
282 e   O0001   254
283 f   N0183   254
284 i   O0001   254

And this the result that I want:

Code    Supplier ID group  Project name
1   d   C0003   0          E01
2   e   R9996   0          E01
3   f   O0001   0          E01
4   j   MT0021  0          E01
5   k   DY0001  0          E01
6   p   B0114   0          E01
7   z   J0002   0          E01 
57  d   T0096   48         E02
58  e   T0015   48         E02
59  f   R0167   48         E02
60  i   G0004   48         E02
61  j   T0021   48         E02
62  k   A0003   48         E02
63  p   S0035   48         E02
64  z   F0006   48         E02
65  z   C0002   48         E02
113 j   R0009   94         E03
114 z   A0013   94         E03
169 e   O0001   147        E04
170 z   A0013   147        E04        
281 d   C0003   254        E05
282 e   O0001   254        E05
283 f   N0183   254        E05
284 i   O0001   254        E05

CodePudding user response:

IIUC, you can simply use groupby to get the group number (ngroup) and map to the project name:

ids_cons["Project name"] = ids_cons.groupby("group").ngroup().map(projects["Project name"])

>>> ids_cons
    Code Supplier      ID  group Project name
0      1        d   C0003      0          E01
1      2        e   R9996      0          E01
2      3        f   O0001      0          E01
3      4        j  MT0021      0          E01
4      5        k  DY0001      0          E01
5      6        p   B0114      0          E01
6      7        z   J0002      0          E01
7     57        d   T0096     48          E02
8     58        e   T0015     48          E02
9     59        f   R0167     48          E02
10    60        i   G0004     48          E02
11    61        j   T0021     48          E02
12    62        k   A0003     48          E02
13    63        p   S0035     48          E02
14    64        z   F0006     48          E02
15    65        z   C0002     48          E02
16   113        j   R0009     94          E03
17   114        z   A0013     94          E03
18   169        e   O0001    147          E04
19   170        z   A0013    147          E04
20   281        d   C0003    254          E05
21   282        e   O0001    254          E05
22   283        f   N0183    254          E05
23   284        i   O0001    254          E05
Inputs:
projects = pd.DataFrame({"Project name": ["E01","E02","E03","E04","E05"]})

>>> projects
  Project name
0          E01
1          E02
2          E03
3          E04
4          E05
  • Related