Home > database >  What is the Regex expression for substituting every second occurrence of a word in a string?
What is the Regex expression for substituting every second occurrence of a word in a string?

Time:05-26

The string is as below,

LOCATION AQI CATEGORY RANGE Dhaka a 251 VERY UNHEALTHY 195-306 Chittagong b 122 CAUTION 82-162 Gazipur c 237 VERY UNHEALTHY - Narayanganj c 335 EXTREMELY UNHEALTHY - Sylhetc c 159 UNHEALTHY - Khulna c DNA DNA - Rajshahi c 156 UNHEALTHY - Barisal c 192 UNHEALTHY - Savar DNA DNA Muradpur 126 CAUTION Cumilla 157 DNA Sandip DNA WARNING

Here, after every mixedCased words, the NULL/None values are represented as DNA for both the AQI values and climate condition. My goal is to create a dataframe as below,

LOCATION AQI
Dhaka 172
Chittagong 125
Gazipur 178
... ...
Khulna 0
... ...

I only need the specific things as seen in the table above. Also, notice that only the AQI's DNA is to substituted to a numerical value for example 0 for my model.

My approach combined with user Corralien's one

re.sub("(\b[A-Z] \b)|(\d -\d )|(\s\w\s)|(-)",' ')

CodePudding user response:

You can use regex to parse your string:

import pandas as pd
import re

s = """LOCATION AQI CATEGORY RANGE Dhaka 172 UNHEALTHY Chittagong 125 CAUTION Gazipur 178 UNHEALTHY Narayanganj 174 UNHEALTHY Sylhetc 129 CAUTION Khulna DNA DNA Rajshahi 118 CAUTION Barisal 118 CAUTION Savar DNA DNA Muradpur 126 CAUTION Cumilla 157 DNA Sandip DNA WARNING"""

data = []
for rec in re.sub(r'\b[A-Z] \b', '', s).strip().split('  '):
    rec = rec.rsplit(maxsplit=1)
    loc = rec[0].strip()
    aqi = int(rec[1]) if len(rec) > 1 else 0
    data.append({'LOCATION': loc, 'AQI': aqi})
df = pd.DataFrame(data)

Output:

LOCATION AQI
Dhaka 172
Chittagong 125
Gazipur 178
Narayanganj 174
Sylhetc 129
Khulna 0
Rajshahi 118
Barisal 118
Savar 0
Muradpur 126
Cumilla 157
Sandip 0

CodePudding user response:

The first step is to extract individual values. I assume that there are no items in the dataset that consist of multiple words:

items = s.split()

Strangely, your dataframe contains four columns but data exists only for three columns. I assume that either there is a mistake or the fourth column is empty. Since we do not need it, anyway, who cares?

In the rest of the code, I:

  1. Create a 1D numpy array from the list of data

  2. Reshape it into three columns and as many rows as needed

  3. Create a dataframe from that array, using the first three original items as column names

  4. Replace 'DNA's with NaNs and later with 0s

  5. Convert the LOCATION column into the index, extract the AQI column, and convert it to integer numbers.

    pd.DataFrame(np.array(items[4:]).reshape(-1, 3), columns=items[:3])\
                                    .replace('DNA', np.nan).fillna(0)\
                                    .set_index('LOCATION')['AQI'].astype(int)
    

There is no need for regex.

CodePudding user response:

Try this:

import re
import pandas as pd

test_str = "LOCATION AQI CATEGORY RANGE Dhaka 172 UNHEALTHY Chittagong 125 CAUTION Gazipur 178 UNHEALTHY Narayanganj 174 UNHEALTHY Sylhetc 129 CAUTION Khulna DNA DNA Rajshahi 118 CAUTION Barisal 118 CAUTION Savar DNA DNA Muradpur 126 CAUTION Cumilla 157 DNA Sandip DNA WARNING"

match = re.findall('([A-Za-z] )\s((\d )|(DNA))', test_str)
match = [val[:2] for val in match]

df = pd.DataFrame(match, columns =['LOCATION', 'AQI'])
df["AQI"].replace({"DNA":0}, inplace=True)
print(df)

Output:

       LOCATION  AQI
0         Dhaka  172
1    Chittagong  125
2       Gazipur  178
3   Narayanganj  174
4       Sylhetc  129
5        Khulna    0
6      Rajshahi  118
7       Barisal  118
8         Savar    0
9      Muradpur  126
10      Cumilla  157
11       Sandip    0
  • Related