I have files with 6 character codes of the form AANNAA where A=letter and N=number. I want to flag the rows that fit (assign 0) or don't fit (assign 1) that pattern. The following code works in that it identifies codes that don't fit as NaN when it is in main, but gives ValueError: Cannot set a DataFrame with multiple columns to the single column code when I try put it into a function.
wave2['code'] = wave2[wave2['SGIC2'].str.match(r'^[A-Za-z]{2}[0-9]{2}[A-Za-z]{2}$') ==True]
I have included the code that checks variable name length in case that is part of the problem.
import pandas as pd
wave2 = pd.read_csv ('wave2.csv')
def main():
wave2.rename(columns={ wave2.columns[0]: "SGIC2" }, inplace = True)
#wave2['code'] = wave2[wave2['SGIC2'].str.match(r'^[A-Za-z]{2}[0-9]{2}[A-Za-z]{2}$') ==True]
#print(wave2)
check_valid(wave2)
def check_valid(dfName): #check length and composition AANNAA
length = (dfName.iloc[:,0].str.len())
dfName['lengthinV'] = (length != 6).astype(int)
nlengthinV = str(dfName['lengthinV'].values.sum()) #will not be needed once regex works
dfName['code'] = dfName[dfName.iloc[:,0].str.match(r'^[A-Za-z]{2}[0-9]{2}[A-Za-z]{2}$')==True]
print(dfName)
return nlengthinV
if __name__ == "__main__":
main()
Data = ('AB01ER','DA23RE','MN34ER','FG19SD','BB21BB', TR15HG','SE21AR','TI85BV','LK31YU','WI29VV','WI13AL', 'HL29WE','HL29WE','IH8THS','TH15P8', 'AS43GGG', 'J12RT', 'RT13CA', 'CH08VI', 'KK09DE')
I don't know if the problem is me not understanding data types or objects in dataframes or something within the other function. I have tried many combinations and permutations of things but can't crack it. I am expecting the same output as I get when the code is in main:
SGIC2 code lengthinV
0 AB01ER AB01ER 0
1 DA23RE DA23RE 0
2 MN34ER MN34ER 0
3 FG19SD FG19SD 0
4 BB21BB BB21BB 0
5 TR15HG TR15HG 0
6 SE21AR SE21AR 0
7 TI85BV TI85BV 0
8 LK31YU LK31YU 0
9 WI29VV WI29VV 0
10 WI13AL WI13AL 0
11 HL29WE HL29WE 0
12 HL29WE HL29WE 0
13 IH8THS NaN 0
14 TH15P8 NaN 0
15 AS43GGG NaN 1
16 J12RT NaN 1
17 RT13CA RT13CA 0
18 CH08VI CH08VI 0
19 KK09DE KK09DE 0
Thank you for any and all assistance.
CodePudding user response:
The problem
You are applying a boolean index to all columns of your dataframe. Attempting to assign this resulting dataframe back to a single column is what causes Pandas to complain -- because you simply can't do that.
Let's take it step by step. First, look at just the result of doing the regex match:
In [8]: wave2.iloc[:,0].str.match(r'^[A-Za-z]{2}[0-9]{2}[A-Za-z]{2}$')
Out[8]:
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 True
10 True
11 True
12 True
13 False
14 False
15 False
16 False
17 True
18 True
19 True
Name: SGIC2, dtype: bool
Now, when you take this boolean array and you attempt to index the dataframe with it, you are telling Pandas that you want to filter out any rows of your dataframe whose index matches the index of the boolean array where the value is False
:
In [9]: wave2[wave2.iloc[:,0].str.match(r'^[A-Za-z]{2}[0-9]{2}[A-Za-z]{2}$')]
Out[9]:
SGIC2 lengthinV
0 AB01ER 0
1 DA23RE 0
2 MN34ER 0
3 FG19SD 0
4 BB21BB 0
5 TR15HG 0
6 SE21AR 0
7 TI85BV 0
8 LK31YU 0
9 WI29VV 0
10 WI13AL 0
11 HL29WE 0
12 HL29WE 0
17 RT13CA 0
18 CH08VI 0
19 KK09DE 0
Notice the missing rows in the dataframe. It seems to me what you're trying to do here is filter out rows where the SGIC2
matches the pattern.
Solutions
Judging by your comments, you seemed to be trying to filter out values that didn't match the pattern, and that lengthinV
would be obsoleted by the regex -- which is correct -- so we won't bother with it.
By creating the boolean array code_mask
, you can apply it to the original dataframe, thereby filtering out invalid values:
code_mask = df["SGIC2"].str.match(r"^\w{2}\d{2}\w{2}$")
df = df[code_mask]
To use a function, you need to decide whether you want the function to modify the input dataframe, modify a global dataframe, or copy an input dataframe and return a new one. I'd suggest the latter:
def check_code_and_length(df: pd.DataFrame) -> pd.DataFrame:
df_copy = df.copy()
# Notice the change I made to the regular expression
mask = df["SGIC2"].str.match(r"^\w{2}\d{2}\w{2}$")
# Note that I've also reset the index and dropped the old one
return df_copy[mask].reset_index(drop=True)
Then, to get your new dataframe, you simply call the function with the input frame and either "overwrite" the original explicitly, or assign it to a new variable:
if __name__ == "__main__":
wave2 = pd.read_csv ("wave2.csv")
wave2.rename(columns={ wave2.columns[0]: "SGIC2" }, inplace = True)
wave2_checked = check_code_and_length(wave2)
Demo:
In [21]: df
Out[21]:
SGIC2
0 AB01ER
1 DA23RE
2 MN34ER
3 FG19SD
4 BB21BB
5 TR15HG
6 SE21AR
7 TI85BV
8 LK31YU
9 WI29VV
10 WI13AL
11 HL29WE
12 HL29WE
13 IH8THS
14 TH15P8
15 AS43GGG
16 J12RT
17 RT13CA
18 CH08VI
19 KK09DE
In [22]: check_code_and_length(df)
Out[22]:
SGIC2
0 AB01ER
1 DA23RE
2 MN34ER
3 FG19SD
4 BB21BB
5 TR15HG
6 SE21AR
7 TI85BV
8 LK31YU
9 WI29VV
10 WI13AL
11 HL29WE
12 HL29WE
13 TH15P8
14 RT13CA
15 CH08VI
16 KK09DE
In [23]: df # notice the original wasn't modified
Out[23]:
SGIC2
0 AB01ER
1 DA23RE
2 MN34ER
3 FG19SD
4 BB21BB
5 TR15HG
6 SE21AR
7 TI85BV
8 LK31YU
9 WI29VV
10 WI13AL
11 HL29WE
12 HL29WE
13 IH8THS
14 TH15P8
15 AS43GGG
16 J12RT
17 RT13CA
18 CH08VI
19 KK09DE