How do I remove unwanted parts from strings in a Python DataFrame column-CodePudding

Based on the script originally suggested by u/commandlineluser at reddit, I (as a Python novice) attempted to revise the original code to remove unwanted parts that vary across column values. The Python script involves creating a dictionary with keys and values and using a list comprehension with str.replace.

(part of the original script by u/commandlineluser at reddit)

extensions = "dat", "ssp", "dta", "v9", "xlsx"

(The next line is my revision to the above part, and below is the complete code block)

extensions = "dat", "ssp", "dta", "20dta", "u20dta", "f1dta", "f2dta", "v9", "xlsx"

Some of the results are different than what I desire. Please see below (what I tried).

import pandas as pd
import re
data = {"full_url": ['https://meps.ahrq.gov/data_files/pufs/h225/h225dat.zip',
                     'https://meps.ahrq.gov/data_files/pufs/h51bdat.zip',
                     'https://meps.ahrq.gov/data_files/pufs/h47f1dat.zip',
                     'https://meps.ahrq.gov/data_files/pufs/h225/h225ssp.zip',
                     'https://meps.ahrq.gov/data_files/pufs/h220i/h220if1dta.zip',
                     'https://meps.ahrq.gov/data_files/pufs/h220h/h220hv9.zip',
                     'https://meps.ahrq.gov/data_files/pufs/h220e/h220exlsx.zip',
                     'https://meps.ahrq.gov/data_files/pufs/h224/h224xlsx.zip',
                     'https://meps.ahrq.gov/data_files/pufs/h036brr/h36brr20dta.zip',
                     'https://meps.ahrq.gov/data_files/pufs/h036/h36u20dta.zip',
                     'https://meps.ahrq.gov/data_files/pufs/h197i/h197if1dta.zip',
                     'https://meps.ahrq.gov/data_files/pufs/h197i/h197if2dta.zip']}
df = pd.DataFrame(data)
extensions = ["dat", "ssp", "dta", "20dta", "u20dta", "f1dta", "f2dta", "v9", "xlsx"]
replacements = dict.fromkeys((f"{ext}[.]zip$" for ext in extensions), "")
df["file_id"] = df["full_url"].str.split("/").str[-1].replace(replacements, regex=True)
print(df["file_id"])

Annotated output

0         h225 (looks good)
1         h51b (looks good)
2        h47f1 (h47 -> desired)
3         h225 (looks good)
4      h220if1 (h220i -> desired)
5        h220h  (looks good)
6        h220e  (looks good)
7         h224 (looks good)
8     h36brr20 (h36brr -> desired)
9       h36u20 (h36 -> desired)
10     h197if1 (h197i -> desired)
11     h197if2 (h197i -> desired)

CodePudding user response：

You have two issues here, and they are all in this line:

extensions = ["dat", "ssp", "dta", "20dta", "u20dta", "f1dta", "f2dta", "v9", "xlsx"]

First issue

The first issue is in the order of the elements of this list. "dat" and "dta" are substrings of other elements in this string and they are at the front of this list. Let's take an example: h47f1dat.zip needs to become h47. But in these lines:

replacements = dict.fromkeys((f"{ext}[.]zip$" for ext in extensions), "")
df["file_id"] = df["full_url"].str.split("/").str[-1].replace(replacements, regex=True)

You keep the order, meaning that you'll first be filtering with the "dat" string, which becomes h47f1. This can be easily fixed by reordering your list.

Second issue

You missed an entry in your extensions list: if you want h47f1dat.zip to become h47 you need to have "f1dat" in your list but you only have "f1dta".

Conclusion

You were almost there! There was simply a small issue with the order of the elements and one extension was missing (or you have a typo in your URLs).

The following extensions list:

extensions = ["ssp", "20dta", "u20dta", "f1dat", "f1dta", "f2dta", "v9", "dat", "dta", "xlsx"]

Together with the rest of your code gives you the result you want:

0       h225                                                                                                                                                                                                                                                                    
1       h51b                                                                                                                                                                                                                                                                    
2        h47                                                                                                                                                                                                                                                                    
3       h225                                                                                                                                                                                                                                                                    
4      h220i                                                                                                                                                                                                                                                                    
5      h220h                                                                                                                                                                                                                                                                    
6      h220e                                                                                                                                                                                                                                                                    
7       h224                                                                                                                                                                                                                                                                    
8     h36brr                                                                                                                                                                                                                                                                    
9       h36u                                                                                                                                                                                                                                                                    
10     h197i                                                                                                                                                                                                                                                                    
11     h197i

CodePudding user response：

Good catch about the issue of the order of the elements and the missing extension! Thank you.

Question 1: Do you mean the list extensions is not sorted alphabetically? Can I not use the Python sort() method to sort the list? I have over one thousand rows in the actual dataframe, and I prefer to sort the list programmatically. I hope I do not misunderstand your comments.

Question 2: I don't understand why I am getting h36u instead of the desired value h36 in the output even after reordering the list as you suggested. Any thoughts?

I have tried another approach (code below) using Jupyter Lab, which provides the output in which the first two values are different from the desired output (also shown below), but the other values seem to be what I desire including h36.

df["file_id"] = df["full_url"].str.split("/").str[-1].str.replace(r'(\dat.zip \
|f1dat.zip|dta.zip|f1dta.zip|f2dta.zip|20dta.zip|u20dta.zip|xlsx.zip|v9.zip|ssp.zip)' \
,'', regex=True)
print(df["file_id"])

Output (annotated)

0     h225dat.zip (not desired; h225 desired)
1     h51bdat.zipn (not desired; h51b desired)
2             h47
3            h225
4           h220i
5           h220h
6           h220e
7            h224
8          h36brr
9             h36
10          h197i
11          h197i

Question 3: Any comments on the above alternative code snippets?