I have a csv that I want to use to search an api for data, but the row which stores the data used for the api search can contain a second value separated by ;
like this:
2 Jan Rohls Kunst und Religion zwischen Mittelalter und Barock : von Dante bis Bach -
3 Karl-Markus Ritter Der Dom zu Speyer : weltliche Macht und christlicher Glaube 9783806241280; 3806241287
4 Hape Kerkeling Ich bin dann mal weg : meine Reise auf dem Jakobsweg 9783890296005; 3890296009
I want to remove the values after and including ;
so that only the 9783XXX remains in the row ISBN. It would be perfect, if the code could be included in the existing process, but I haven't found a way how to manage that.
Here's the full extracted csv: https://pastebin.com/frAK9NBG
These are some scripts I use to
a) CLean a pre-existing csv:
import pandas as pd
import glob
import os
path= "./**/extract/"
filelist=glob.glob('./**/Reihe A/Reihe*.csv',recursive=True)
print(filelist)
for file in filelist:
data=pd.read_csv(file, sep="\t", encoding='utf8')
data.columns
title=[] #this gets all the titles
for row in data['Titel']:
title.append(row)
author=[] #this gets all the authors
for row in data['Verfasser']:
author.append(row)
isbn=[] #this gets all the isbns
for row in data['ISBN']:
isbn.append(row)
df=pd.DataFrame({'Verfasser': author, 'Titel': title, 'ISBN': isbn}) #create new csvs based on extracted data
#save csv in set path
df.to_csv(path file "_" "extract", sep='\t', encoding='utf8')
#df.to_csv(path file '_' 'extract.csv', sep="\t") #add endpath , encoding='utf-8' back if needed
print(file '_' 'extract.csv' ' saved to ' path)
and to
b) search the api
import glob
import pandas as pd
from urllib.request import urlopen
#import generated csvs from other script
filelist=glob.glob('./**/Reihe*_extract.csv',recursive=True)
print(filelist)
for file in filelist:
#read csv, make a list of all isbns
data=pd.read_csv(file, sep="\t", encoding='utf8')
isbnlist=[]
for row in data['ISBN']:
isbnlist.append(row)
#for each isbn in list, get data from api
for isbn in isbnlist:
url = 'http://sru.k10plus.de/gvk!rec=1?version=1.1&operation=searchRetrieve&query=pica.isb=[isbn]&maximumRecords=10&recordSchema=marcxml'
url = url.replace('[isbn]', isbn)
print(url)
as of right now, I'm splitting my codes based on their function (clean csv, search api etc.) but I plan to make a launcher script that does all for the end user.
Any help is appreciated.
CodePudding user response:
Try this, split by seperator and keep wanted split:
data['ISBN'] = [x.split(' ')[0] for x in data['ISBN']] #keeps first portion of split.