Home > OS >  Create a new dataframe based off of strings lengths of values from existing dataframe
Create a new dataframe based off of strings lengths of values from existing dataframe

Time:06-30

Sorry if the title is unclear - I wasn't too sure how to word it. So I have a dataframe that has two columns for old IDs and new IDs.

df = pd.DataFrame({'old_id':['111', '2222','3333', '4444'], 'new_id':['5555','6666','777','8888']})

I'm trying to figure out a way to check the string length of each column/row and return any id's that don't match the required string length of 4 into a new dataframe. This will eventually turn into a dictionary of incorrect IDs.

This is the approach I'm currently taking:

incorrect_id_df = df[df.applymap(lambda x: len(x) != 4)]

and the current output:

old_id new_id
 111    NaN
 NaN    NaN
 NaN    777
 NaN    NaN

I'm not sure where to go from here and I'm sure there's a much better approach but this is the output I'm looking for where it's a single column dataframe with just the IDs that don't match the required string length and also with the column name id:

 id
 111
 777

CodePudding user response:

here's part of an answer

df = pd.DataFrame({'old_id':['111', '2222','3333', '4444'], 'new_id':['5555','6666','777','8888']})
all_ids = df.values.flatten()
bad_ids = [bad_id for bad_id in all_ids if len(bad_id) != 4]
bad_ids

CodePudding user response:

Or if you are not completely sure what are you doing, you can always use brutal force method :D

import pandas as pd

df = pd.DataFrame({'old_id':['111', '2222','3333', '4444'], 'new_id':['5555','6666','777','8888']})

rows,colums= df.shape

#print (df)

for row in range(rows):
    k= (df.loc[row])
    for colum in range(colums):
        #print(k.iloc[colum])
        if len(k.iloc[colum])!=4:
            print("Bad size of ID on row:" str(row) " colum:" str(colum))

CodePudding user response:

As commented by Jon Clements, stack could be useful here – it basically stacks (duh) all columns on top of each other:

>>> df[df.applymap(len) != 4].stack().reset_index(drop=True)
0    111
1    777
dtype: object

To turn that into a single-column df named id, you can extend it with a .rename('id').to_frame().

CodePudding user response:

I would stack both columns in a single one, and select the ids with length 4:

ids = df.stack()

bad_ids = ids[ids.str.len() != 4]

Output:

>>> bad_ids

0  old_id    111
2  new_id    777
dtype: object
  • Related