Python str.match regex works in main but not within another function-CodePudding

I have files with 6 character codes of the form AANNAA where A=letter and N=number. I want to flag the rows that fit (assign 0) or don't fit (assign 1) that pattern. The following code works in that it identifies codes that don't fit as NaN when it is in main, but gives ValueError: Cannot set a DataFrame with multiple columns to the single column code when I try put it into a function.

wave2['code'] = wave2[wave2['SGIC2'].str.match(r'^[A-Za-z]{2}[0-9]{2}[A-Za-z]{2}$') ==True]

I have included the code that checks variable name length in case that is part of the problem.

import pandas as pd

wave2 = pd.read_csv ('wave2.csv')

def main():
    wave2.rename(columns={ wave2.columns[0]: "SGIC2" }, inplace = True)

    #wave2['code'] = wave2[wave2['SGIC2'].str.match(r'^[A-Za-z]{2}[0-9]{2}[A-Za-z]{2}$') ==True]
    #print(wave2)
    check_valid(wave2)

def check_valid(dfName):          #check length and composition AANNAA
    length = (dfName.iloc[:,0].str.len())
    dfName['lengthinV'] = (length != 6).astype(int)
    nlengthinV = str(dfName['lengthinV'].values.sum())    #will not be needed once regex works

    dfName['code'] = dfName[dfName.iloc[:,0].str.match(r'^[A-Za-z]{2}[0-9]{2}[A-Za-z]{2}$')==True]
    print(dfName)
    return nlengthinV

if __name__ == "__main__":
    main()

Data = ('AB01ER','DA23RE','MN34ER','FG19SD','BB21BB', TR15HG','SE21AR','TI85BV','LK31YU','WI29VV','WI13AL', 'HL29WE','HL29WE','IH8THS','TH15P8', 'AS43GGG', 'J12RT',  'RT13CA', 'CH08VI', 'KK09DE')

I don't know if the problem is me not understanding data types or objects in dataframes or something within the other function. I have tried many combinations and permutations of things but can't crack it. I am expecting the same output as I get when the code is in main:

 SGIC2    code  lengthinV
0    AB01ER  AB01ER          0
1    DA23RE  DA23RE          0
2    MN34ER  MN34ER          0
3    FG19SD  FG19SD          0
4    BB21BB  BB21BB          0
5    TR15HG  TR15HG          0
6    SE21AR  SE21AR          0
7    TI85BV  TI85BV          0
8    LK31YU  LK31YU          0
9    WI29VV  WI29VV          0
10   WI13AL  WI13AL          0
11   HL29WE  HL29WE          0
12   HL29WE  HL29WE          0
13   IH8THS     NaN          0
14   TH15P8     NaN          0
15  AS43GGG     NaN          1
16    J12RT     NaN          1
17   RT13CA  RT13CA          0
18   CH08VI  CH08VI          0
19   KK09DE  KK09DE          0

Thank you for any and all assistance.

CodePudding user response：

The problem

You are applying a boolean index to all columns of your dataframe. Attempting to assign this resulting dataframe back to a single column is what causes Pandas to complain -- because you simply can't do that.

Let's take it step by step. First, look at just the result of doing the regex match:

In [8]: wave2.iloc[:,0].str.match(r'^[A-Za-z]{2}[0-9]{2}[A-Za-z]{2}$')
Out[8]:
0      True
1      True
2      True
3      True
4      True
5      True
6      True
7      True
8      True
9      True
10     True
11     True
12     True
13    False
14    False
15    False
16    False
17     True
18     True
19     True
Name: SGIC2, dtype: bool

Now, when you take this boolean array and you attempt to index the dataframe with it, you are telling Pandas that you want to filter out any rows of your dataframe whose index matches the index of the boolean array where the value is False:

In [9]: wave2[wave2.iloc[:,0].str.match(r'^[A-Za-z]{2}[0-9]{2}[A-Za-z]{2}$')]
Out[9]:
     SGIC2  lengthinV
0   AB01ER          0
1   DA23RE          0
2   MN34ER          0
3   FG19SD          0
4   BB21BB          0
5   TR15HG          0
6   SE21AR          0
7   TI85BV          0
8   LK31YU          0
9   WI29VV          0
10  WI13AL          0
11  HL29WE          0
12  HL29WE          0
17  RT13CA          0
18  CH08VI          0
19  KK09DE          0

Notice the missing rows in the dataframe. It seems to me what you're trying to do here is filter out rows where the SGIC2 matches the pattern.

Solutions

Judging by your comments, you seemed to be trying to filter out values that didn't match the pattern, and that lengthinV would be obsoleted by the regex -- which is correct -- so we won't bother with it.

By creating the boolean array code_mask, you can apply it to the original dataframe, thereby filtering out invalid values:

code_mask = df["SGIC2"].str.match(r"^\w{2}\d{2}\w{2}$")

df = df[code_mask]

To use a function, you need to decide whether you want the function to modify the input dataframe, modify a global dataframe, or copy an input dataframe and return a new one. I'd suggest the latter:

def check_code_and_length(df: pd.DataFrame) -> pd.DataFrame:
    df_copy = df.copy()

    # Notice the change I made to the regular expression
    mask = df["SGIC2"].str.match(r"^\w{2}\d{2}\w{2}$")

    # Note that I've also reset the index and dropped the old one
    return df_copy[mask].reset_index(drop=True)

Then, to get your new dataframe, you simply call the function with the input frame and either "overwrite" the original explicitly, or assign it to a new variable:

if __name__ == "__main__":
    wave2 = pd.read_csv ("wave2.csv")
    wave2.rename(columns={ wave2.columns[0]: "SGIC2" }, inplace = True)
    wave2_checked = check_code_and_length(wave2)

Demo:

In [21]: df
Out[21]:
      SGIC2
0    AB01ER
1    DA23RE
2    MN34ER
3    FG19SD
4    BB21BB
5    TR15HG
6    SE21AR
7    TI85BV
8    LK31YU
9    WI29VV
10   WI13AL
11   HL29WE
12   HL29WE
13   IH8THS
14   TH15P8
15  AS43GGG
16    J12RT
17   RT13CA
18   CH08VI
19   KK09DE

In [22]: check_code_and_length(df)
Out[22]:
     SGIC2
0   AB01ER
1   DA23RE
2   MN34ER
3   FG19SD
4   BB21BB
5   TR15HG
6   SE21AR
7   TI85BV
8   LK31YU
9   WI29VV
10  WI13AL
11  HL29WE
12  HL29WE
13  TH15P8
14  RT13CA
15  CH08VI
16  KK09DE

In [23]: df # notice the original wasn't modified
Out[23]:
      SGIC2
0    AB01ER
1    DA23RE
2    MN34ER
3    FG19SD
4    BB21BB
5    TR15HG
6    SE21AR
7    TI85BV
8    LK31YU
9    WI29VV
10   WI13AL
11   HL29WE
12   HL29WE
13   IH8THS
14   TH15P8
15  AS43GGG
16    J12RT
17   RT13CA
18   CH08VI
19   KK09DE