Insert pandas dataframe value based on multiple column conditions-CodePudding

In python, I have a list of 68K files named files_in_folder. Additionally, I have a csv file (pd dataframe) with filenames and extensions. An example:

import pandas as pd
import os 

files_in_folder = ['2.fds', '4.fds', '5.jpg']

df = pd.DataFrame({'filename': ['1.fds', '2.fds', '3.fds', '4.fds', '5.jpg'],
                   'correct_extension?': [None, None, None, None, None],
                   'extension': ['.fds', None, '.json', '.fds', '.jpg']
                  })

For every item in the list I check if the file is in column 'filename'. If the correct extension is in column 'extension' True should be added in column 'correct_extension?' at that row.

On stack I found numpy's 'where', that could do something like this:

for file in files_in_folder:
    extension = os.path.splitext(file)
    df['correct_extension?'] = np.where( ( (df['filename'] == file) & (df['extension'] == extension ) ) , True, False)

However, because of my loop, this method doesn't give the expected results (below). I am looking for someone that can give me a hint on how to solve this problem, preferably with a loop.

I'm very eager to learn from you.

expected result: dataframe ->
'filename': ['1.fds', '2.fds', '3.fds', '4.fds', '5.jpg'],
'correct_extension?': [None, None, None, True, True],
'extension': ['.fds', None, '.json', '.fds', '.jpeg']

CodePudding user response：

You can use str.extract by using regex from extensions in extension column

rexp = '|'.join([ext for ext in df['extension'] if ext is not None])
df['correct_extension?'] = (
    (df['filename'].isin(files_in_folder)) & 
    (df['filename'].str.extract('('  rexp  ')', expand=False) == df['extension'])
)

I m not sure why you want the for loop, but if you need a robust way - that works even if you have duplicate file names in your filename column, here is a way:

for file in files_in_folder:
    extension = os.path.splitext(file)
    idxs = df.loc[df['filename']==file, 'extension'].index
    for idx in idxs:
        df.iloc[idx]['correct_extension?'] = (df.iloc[idx]['filename']==file) & 
        (df.iloc[idx]['extension']==extension[1])

print(df):

  filename  correct_extension? extension
0    1.fds               False      .fds
1    2.fds               False      None
2    3.fds               False     .json
3    4.fds                True      .fds
4    5.fds               False      None

CodePudding user response：

Let's try

df['correct_extension?'].update(df[df['filename'].isin(files_in_folder) & df['extension'].notna()]
                                .apply(lambda row: row['filename'].endswith(row['extension']), axis=1))

print(df)

  filename correct_extension? extension
0    1.fds               None      .fds
1    2.fds               None      None
2    3.fds               None     .json
3    4.fds               True      .fds
4    5.fds               None      None

If you really want the loop

df['correct_extension?'] = np.array([np.where( (df['filename'] == f) & (df['extension'] == '.fds'), True, None)
                                     for f in files_in_folder]).any(axis=0)

print(df)

  filename correct_extension? extension
0    1.fds               None      .fds
1    2.fds               None      None
2    3.fds               None     .json
3    4.fds               True      .fds
4    5.fds               None      None