I am using the BL-Flickr data to practice data cleaning skills. I have been working on loops within this data. I have noticed a pattern where some of the Place of Publications start with "pp. romannumeralssometimes. number. Publisher: PlaceofPublication, DateofPublication".
This happens in about 1% of the data, so I know that it's not the best use of my time to focus here, but I would like to learn how to use loops and clean data where I can move values to other columns. For now I have focused on Place of Pub and Date of Pub.
poplist = df['Place of Publication'].values.tolist()
pop=df['Place of Publication']
dop=df['Date of Publication']
alert1=['pp.'] # do not have to use alert1 or a for loop, however this could have other indicators added if needed
def clean_pop(poplist):
for element in alert1:
if element in poplist:
character_index_comma = poplist.find(',')
character_index_colon = poplist.find(':')
poplist = poplist[character_index_colon 2:character_index_comma]
dop = poplist[character_index_comma:]
return poplist
df['Place of Publication']=df.apply(clean_pop, axis=1)
I have tried the code above with poplist and pop, but both are returning "ValueError: Columns must be same length as key". I have also tried running this code without dop and just focusing on the place of publication (pop/poplist). I have also tried doing this without the for loop and just using if 'pp.' in pop: as well as if 'pp.' in poplist:.
CodePudding user response:
I apologize that it seems my question was not limited to a specific problem. I am new to python and it has been easy to mix too many ideas together and not realize that the concepts are actually individual. I still do not quite understand why I was getting an error for the earlier loop ("ValueError: Columns must be same length as key"), however, I have at least found a different way to process this loop.
However, I had an epiphany and realized a way to create a workable loop:
def clean_dates(item):
pop=str(item.loc['Place of Publication'])
dop=str(item.loc['Date of Publication'])
if pop[0:3] != 'pp.':
return dop
elif pop[0:3] == 'pp.':
character_index_comma = pop.find(',')
dop = pop[character_index_comma 2:]
return dop
df['Date of Publication']=df.apply(clean_dates, axis=1)
def clean_pop(item):
pop=str(item.loc['Place of Publication'])
if pop[0:3] == 'pp.':
character_index_comma = pop.find(',')
character_index_colon = pop.find(':')
pop = pop[character_index_colon 2:character_index_comma]
dop = pop[character_index_comma:]
return pop
df['Place of Publication']=df.apply(clean_pop, axis=1)