In python, I have a list of 68K files named files_in_folder
. Additionally, I have a csv file (pd dataframe) with filenames and extensions. An example:
import pandas as pd
import os
files_in_folder = ['2.fds', '4.fds', '5.jpg']
df = pd.DataFrame({'filename': ['1.fds', '2.fds', '3.fds', '4.fds', '5.jpg'],
'correct_extension?': [None, None, None, None, None],
'extension': ['.fds', None, '.json', '.fds', '.jpg']
})
For every item in the list I check if the file is in column 'filename'. If the correct extension is in column 'extension' True
should be added in column 'correct_extension?' at that row.
On stack I found numpy's 'where', that could do something like this:
for file in files_in_folder:
extension = os.path.splitext(file)
df['correct_extension?'] = np.where( ( (df['filename'] == file) & (df['extension'] == extension ) ) , True, False)
However, because of my loop, this method doesn't give the expected results (below). I am looking for someone that can give me a hint on how to solve this problem, preferably with a loop.
I'm very eager to learn from you.
expected result: dataframe ->
'filename': ['1.fds', '2.fds', '3.fds', '4.fds', '5.jpg'],
'correct_extension?': [None, None, None, True, True],
'extension': ['.fds', None, '.json', '.fds', '.jpeg']
A similar topic I found was: Pandas: How do I assign values based on multiple conditions for existing columns?
CodePudding user response:
You can use str.extract by using regex from extensions in extension
column
rexp = '|'.join([ext for ext in df['extension'] if ext is not None])
df['correct_extension?'] = (
(df['filename'].isin(files_in_folder)) &
(df['filename'].str.extract('(' rexp ')', expand=False) == df['extension'])
)
I m not sure why you want the for loop, but if you need a robust way - that works even if you have duplicate file names in your filename
column, here is a way:
for file in files_in_folder:
extension = os.path.splitext(file)
idxs = df.loc[df['filename']==file, 'extension'].index
for idx in idxs:
df.iloc[idx]['correct_extension?'] = (df.iloc[idx]['filename']==file) &
(df.iloc[idx]['extension']==extension[1])
print(df):
filename correct_extension? extension
0 1.fds False .fds
1 2.fds False None
2 3.fds False .json
3 4.fds True .fds
4 5.fds False None
CodePudding user response:
Let's try
df['correct_extension?'].update(df[df['filename'].isin(files_in_folder) & df['extension'].notna()]
.apply(lambda row: row['filename'].endswith(row['extension']), axis=1))
print(df)
filename correct_extension? extension
0 1.fds None .fds
1 2.fds None None
2 3.fds None .json
3 4.fds True .fds
4 5.fds None None
If you really want the loop
df['correct_extension?'] = np.array([np.where( (df['filename'] == f) & (df['extension'] == '.fds'), True, None)
for f in files_in_folder]).any(axis=0)
print(df)
filename correct_extension? extension
0 1.fds None .fds
1 2.fds None None
2 3.fds None .json
3 4.fds True .fds
4 5.fds None None