I have a pandas dataframe with values on each cell like this:
GRI 101: Foundation:
• Clause 1.1 (Stakeholder Inclusiveness principle)
• Clause 1.3 (Materiality principle)
• Clause 2.1 (Applying the Reporting Principles)
GRI 102: General Disclosures: Disclosures 102-40, 102-42, 102-43, and 102-44
By using:
df['column'].str.findall(r'\d -\d ').str.join('\n')
I am able to get the values:
102-40
102-42
102-43
102-44
Which is good, but apart from those values I got, I need a way to also extract the the 101 and append the clauses you see in the text. I need to be able to also gather those values and get something like this:
102-40
102-42
102-43
102-44
101-1.1
101-1.3
101-2.1
Not sure if it is possible, the reason for this is I have tons of values like this and I need to get them to use them then as reference to be able to perform a relationship between two ESG standards based on those values.
CodePudding user response:
See if this work:
Disclosures = df['0'].str.findall(r'\d -\d ').str.join('\n')[0]
top_number = str(df['0'].str.findall(r'GRI \d ').str.join('')).split('GRI')[1].strip()
clauses = str(df['0'].str.findall(r'[\d ][.][\d ]').str.join(' ')[0]).split(' ')
for c in clauses:
print(top_number, '-', c, sep='')
print(Disclosures)
Here is my code example:
d = {'0': ["""GRI 101: Foundation:
• Clause 1.1 (Stakeholder Inclusiveness principle)
• Clause 1.3 (Materiality principle)
• Clause 2.1 (Applying the Reporting Principles)
GRI 102: General Disclosures: Disclosures 102-40, 102-42, 102-43, and 102-44""", "b"], '1': [3, 4], '2': [4, 5], '3': [7,8], '4': [9,10], '5': [12,13], '6': [15,17]}
df = pd.DataFrame(data=d)
Disclosures = df['0'].str.findall(r'\d -\d ').str.join('\n')[0]
top_number = str(df['0'].str.findall(r'GRI \d ').str.join('')).split('GRI')[1].strip()
clauses = str(df['0'].str.findall(r'[\d ][.][\d ]').str.join(' ')[0]).split(' ')
for c in clauses:
print(top_number, '-', c, sep='')
print(Disclosures)
output:
101-1.1
101-1.3
101-2.1
102-40
102-42
102-43
102-44
CodePudding user response:
Try:
import re
def process(x):
first = re.search(r"\d ", x).group(0)
disclosures = re.findall(r"\d -\d ", x)
clauses = re.findall(r"Clause (\d \.\d )", x)
return "\n".join(
["\n".join(disclosures), "\n".join(f"{first}-{c}" for c in clauses)]
)
df["result"] = df["column"].apply(process)
print(df.to_markdown())
Prints:
column | result | |
---|---|---|
0 | GRI 101: Foundation: | 102-40 |
• Clause 1.1 (Stakeholder Inclusiveness principle) | 102-42 | |
• Clause 1.3 (Materiality principle) | 102-43 | |
• Clause 2.1 (Applying the Reporting Principles) | 102-44 | |
GRI 102: General Disclosures: Disclosures 102-40, 102-42, 102-43, and 102-44 | 101-1.1 | |
101-1.3 | ||
101-2.1 |