I have a list of 6000 files and a pandas data frame that contains a list of URLs. Some of those URLs match the names of those 6000 files. While I am iterating through the list of the files for some other purpose (extracting text), I am also looking for matching names in the URLs column. If there is a match, I write the matching file path in a new column.
Does not sound complicated, except for the fact that my code does not work:
files = glob.glob("materials/*.html")
data = pd.read_csv("file.csv")
def match_name(row):
if filename in row['URL']:
return file
for file in files:
filename = os.path.basename(f'{file[:-5]}')
extractor = open(file, 'rb')
...
full = [p_text, os.path(basename(file)]
df_full = pd.DataFrame(full)
data['Path'] = dataset.apply(lambda x: match_name(x), axis=1)```
However, it does not work and all the columns return Null. I also tried:
data['Path'] = data.apply(lambda x: file if filename in x else None, axis=1)
Those columns of the data frame look like this:
|Name | Value | URL |
|-----|-------|-----------------------------|
|Name1|Value1 |http://example.com/LALAC.html|
|Name2|Value2 |http://example.com/ABASW.html|
|Name3|Value3 |http://example.com/4421C.html|
The files are LALAC.txt, SDDSA1.txt, 4421C.html, etc. The output that I want to get is:
|Name | Value | URL |Path |
|-----|-------|-----------------------------|-------------------|
|Name1|Value1 |http://example.com/LALAC.html|materials/LALAC.txt|
|Name2|Value2 |http://example.com/ABASW.html|None |
|Name3|Value3 |http://example.com/4421C.html|materials/4421C.txt|
The path does exist in the folder, but I am missing the reason why I keep getting None. Any ideas?
CodePudding user response:
If you have all of the file names in a set, and all of the URLs in a dataframe, you can do:
import pandas as pd
filenames = {"LALAC", "ABASW", "4421C"}
df = pd.DataFrame({'URL': [
"http://example.com/LALAC.html",
"http://example.com/ABASW.html",
"http://example.com/4421C.html",
"HTTP://example.com/12345.html"
]})
df["Path"] = "materials/" df["URL"].str.findall('|'.join(filenames)).str[0] ".txt"
result:
URL path
0 http://example.com/LALAC.html materials/LALAC.txt
1 http://example.com/ABASW.html materials/ABASW.txt
2 http://example.com/4421C.html materials/4421C.txt
3 http://example.com/12345.html NaN