Home > Mobile >  Using nested loops to find a string and only keep parts of a cell value
Using nested loops to find a string and only keep parts of a cell value

Time:02-03

I am using the BL-Flickr data to practice data cleaning skills. I have been working on loops within this data. I have noticed a pattern where some of the Place of Publications start with "pp. romannumeralssometimes. number. Publisher: PlaceofPublication, DateofPublication".

This happens in about 1% of the data, so I know that it's not the best use of my time to focus here, but I would like to learn how to use loops and clean data where I can move values to other columns. For now I have focused on Place of Pub and Date of Pub.

poplist = df['Place of Publication'].values.tolist()

pop=df['Place of Publication']

dop=df['Date of Publication']

alert1=['pp.'] # do not have to use alert1 or a for loop, however this could have other indicators added if needed

def clean_pop(poplist):

  for element in alert1:
    if element in poplist:
      character_index_comma = poplist.find(',')
      character_index_colon = poplist.find(':')
      poplist = poplist[character_index_colon 2:character_index_comma]
      dop = poplist[character_index_comma:]
    return poplist

df['Place of Publication']=df.apply(clean_pop, axis=1)

I have tried the code above with poplist and pop, but both are returning "ValueError: Columns must be same length as key". I have also tried running this code without dop and just focusing on the place of publication (pop/poplist). I have also tried doing this without the for loop and just using if 'pp.' in pop: as well as if 'pp.' in poplist:.

CodePudding user response:

I apologize that it seems my question was not limited to a specific problem. I am new to python and it has been easy to mix too many ideas together and not realize that the concepts are actually individual. I still do not quite understand why I was getting an error for the earlier loop ("ValueError: Columns must be same length as key"), however, I have at least found a different way to process this loop.

However, I had an epiphany and realized a way to create a workable loop:

def clean_dates(item):
  pop=str(item.loc['Place of Publication'])
  dop=str(item.loc['Date of Publication'])

  if pop[0:3] != 'pp.':
    return dop

  elif pop[0:3] == 'pp.':
    character_index_comma = pop.find(',')
    dop = pop[character_index_comma 2:]
  return dop

df['Date of Publication']=df.apply(clean_dates, axis=1)

def clean_pop(item):

  pop=str(item.loc['Place of Publication'])

 if pop[0:3] == 'pp.':
    character_index_comma = pop.find(',')
    character_index_colon = pop.find(':')
    pop = pop[character_index_colon 2:character_index_comma]
    dop = pop[character_index_comma:]
  return pop

df['Place of Publication']=df.apply(clean_pop, axis=1)
  • Related