I was trying to parse some important data from html tables from database links using pandas and ran into one problem. Code:
import pandas as pd
df_list = pd.read_html('iss1.html', match='Supplier ID')
df_list2 = pd.read_html('iss2.html', match='Attachments:')
df_list3 = pd.read_html('P.html', match='AWS-HO')
df = pd.concat(df_list, axis=1)
df2 = pd.concat(df_list2, axis=1)
df3 = pd.concat(df_list3, axis=1)
df3 = df3.iloc[:, ::-1]
df_rev = df.iloc[:, ::-1]
df2 = df2.iloc[:, ::-1]
df_rev.columns = df_rev.iloc[0]
lc = df_rev[["Code"]]
lc = pd.DataFrame({"Code": df_rev["Code"].values.T.ravel(),})
lc = lc[lc['Code'] != 'SSB tracking']
lc = lc[lc['Code'] != 'USB']
lc = lc[lc['Code'] != 'Review']
lc = lc[lc['Code'] != '( Review )']
sup = df_rev[["ID"]]
sup = pd.DataFrame({"ID": df_rev["ID"].values.T.ravel(),})
sup = sup[sup['ID'] != 'SupID']
lc_sup = pd.concat([lc, sup], axis=1) # 'group' column
lc_sup['group'] = lc_sup['Lang Code'].isna().cumsum()
lc_sup = lc_sup.sort_values(['group', 'Lang Code'], ascending=True) # 'group' column
lc_sup = lc_sup[lc_sup['Lang Code'].notna()] # 'group' column
ids_cons = pd.concat([ids_cons, lc_sup], axis=1)
This is 'ids_cons' DF. I created the "group" column because of problems with NaNs and to sort the values. (lc_sup DF in code above)
The range of each project is individual and specialized according to the "group" column. Each identical digit refers to separate projects. In my example, there are 4 projects together.
Code Supplier ID group
1 d C0003 0
2 e R9996 0
3 f O0001 0
4 j MT0021 0
5 k DY0001 0
6 p B0114 0
7 z J0002 0
57 d T0096 48
58 e T0015 48
59 f R0167 48
60 i G0004 48
61 j T0021 48
62 k A0003 48
63 p S0035 48
64 z F0006 48
65 z C0002 48
113 j R0009 94
114 z A0013 94
169 e O0001 147
170 z A0013 147
281 d C0003 254
282 e O0001 254
283 f N0183 254
284 i O0001 254
So what I want to do now is add a project name for each of the 4 projects. I have a separate DF(just a list) with project names that are grouped into one column. The problem is that required project name appears only 1time here and I need to add it to every project by 'group' column.
Previous example added Project names DF:
Code Supplier ID group Project name
1 d C0003 0 E01
2 e R9996 0 E02
3 f O0001 0 E03
4 j MT0021 0 E04
5 k DY0001 0 E05
6 p B0114 0
7 z J0002 0
57 d T0096 48
58 e T0015 48
59 f R0167 48
60 i G0004 48
61 j T0021 48
62 k A0003 48
63 p S0035 48
64 z F0006 48
65 z C0002 48
113 j R0009 94
114 z A0013 94
169 e O0001 147
170 z A0013 147
281 d C0003 254
282 e O0001 254
283 f N0183 254
284 i O0001 254
And this the result that I want:
Code Supplier ID group Project name
1 d C0003 0 E01
2 e R9996 0 E01
3 f O0001 0 E01
4 j MT0021 0 E01
5 k DY0001 0 E01
6 p B0114 0 E01
7 z J0002 0 E01
57 d T0096 48 E02
58 e T0015 48 E02
59 f R0167 48 E02
60 i G0004 48 E02
61 j T0021 48 E02
62 k A0003 48 E02
63 p S0035 48 E02
64 z F0006 48 E02
65 z C0002 48 E02
113 j R0009 94 E03
114 z A0013 94 E03
169 e O0001 147 E04
170 z A0013 147 E04
281 d C0003 254 E05
282 e O0001 254 E05
283 f N0183 254 E05
284 i O0001 254 E05
CodePudding user response:
IIUC, you can simply use groupby
to get the group number (ngroup
) and map
to the project name:
ids_cons["Project name"] = ids_cons.groupby("group").ngroup().map(projects["Project name"])
>>> ids_cons
Code Supplier ID group Project name
0 1 d C0003 0 E01
1 2 e R9996 0 E01
2 3 f O0001 0 E01
3 4 j MT0021 0 E01
4 5 k DY0001 0 E01
5 6 p B0114 0 E01
6 7 z J0002 0 E01
7 57 d T0096 48 E02
8 58 e T0015 48 E02
9 59 f R0167 48 E02
10 60 i G0004 48 E02
11 61 j T0021 48 E02
12 62 k A0003 48 E02
13 63 p S0035 48 E02
14 64 z F0006 48 E02
15 65 z C0002 48 E02
16 113 j R0009 94 E03
17 114 z A0013 94 E03
18 169 e O0001 147 E04
19 170 z A0013 147 E04
20 281 d C0003 254 E05
21 282 e O0001 254 E05
22 283 f N0183 254 E05
23 284 i O0001 254 E05
Inputs:
projects = pd.DataFrame({"Project name": ["E01","E02","E03","E04","E05"]})
>>> projects
Project name
0 E01
1 E02
2 E03
3 E04
4 E05