I have a list of tuples of some Wikipedia data that I am scraping. I can get it in a dataframe but its all in 1 column I need it broke out into 4 columns to hold each tuple object.
results = wikipedia.search('Kalim_Aajiz')
df = pd.DataFrame()
data = []
for i in results:
wiki_page = wikipedia.page(i)
data = wiki_page.title, wiki_page.url, wiki_page.summary, wiki_page.pageid
dataList = list(data)
print(dataList)
df = df.append(dataList)
DATA RESULTS:
0 Kalim Aajiz
1 https://en.wikipedia.org/wiki/Kalim_Aajiz
2 Kalim Aajiz (1920 – 14 February 2015) was an I...
3 47137025
0 Robert Thurman
1 https://en.wikipedia.org/wiki/Robert_Thurman
2 Robert Alexander Farrar Thurman (born August 3...
3 475367
0 Ruskin Bond
1 https://en.wikipedia.org/wiki/Ruskin_Bond
2 Ruskin Bond (born 19 May 1934) is an Anglo Ind...
3 965456
0 Haldhar Nag
EXPECTED RESULTS:
NAME | URL | DESCRIPTION | ID
Kalim Aajiz https://en.wikipedia.org/wiki/Kalim_Aajiz was an I... 47137025
CodePudding user response:
Format it into a list of dictionaries, and then make a DataFrame at the end.
results = wikipedia.search('Kalim_Aajiz')
data_list = []
for i in results:
wiki_page = wikipedia.page(i)
data = {'title': wiki_page.title,
'url': wiki_page.url,
'summary': wiki_page.summary,
'pageid': wiki_page.pageid}
data_list.append(data)
df = pd.DataFrame(data_list)
df
Output:
title url summary pageid
0 Kalim Aajiz https://en.wikipedia.org/wiki/Kalim_Aajiz Kalim Aajiz (1920 – 14 February 2015) was an I... 47137025
1 Robert Thurman https://en.wikipedia.org/wiki/Robert_Thurman Robert Alexander Farrar Thurman (born August 3... 475367
2 Ruskin Bond https://en.wikipedia.org/wiki/Ruskin_Bond Ruskin Bond (born 19 May 1934) is an Anglo Ind... 965456
3 Haldhar Nag https://en.wikipedia.org/wiki/Haldhar_Nag Dr. Haldhar Nag (born 31 March 1950) is a Samb... 29466145
4 Sucheta Dalal https://en.wikipedia.org/wiki/Sucheta_Dalal Sucheta Dalal (born 1962) is an Indian busines... 4125323
5 Padma Shri https://en.wikipedia.org/wiki/Padma_Shri Padma Shri (IAST: padma śrī), also spelled Pad... 442893
6 Vairamuthu https://en.wikipedia.org/wiki/Vairamuthu Vairamuthu Ramasamy (born 13 July 1953) is an ... 3604328
7 Sal Khan https://en.wikipedia.org/wiki/Sal_Khan Salman Amin Khan (born October 11, 1976), comm... 26464673
8 Arvind Gupta https://en.wikipedia.org/wiki/Arvind_Gupta Arvind Gupta is an Indian toy inventor and exp... 29176509
9 Rajdeep Sardesai https://en.wikipedia.org/wiki/Rajdeep_Sardesai Rajdeep Sardesai (born 24 May 1965)is an India... 1673653
CodePudding user response:
You could just build a dictionary with your for loop and then create the data frame at the end.
For example:
results = wikipedia.search('Kalim_Aajiz')
data1 = {"NAME": [], "URL": [], "DESCRIPTION": [], "ID": []}
for i in results:
wiki_page = wikipedia.page(i)
data2 = wiki_page.title, wiki_page.url, wiki_page.summary, wiki_page.pageid
for key, value in zip(data1.keys(), data2):
data1[key].append(value)
df = pd.DataFrame(data)
CodePudding user response:
You could set a grouped index value that would allow a pivot
. Specifically np.arange(len(df))//4
. Using the current index 0,1,2,3,0,1,2,3...
to identify the columns for the pivot
.
dfp = (
df.reset_index().assign(s=np.arange(len(df))//4).pivot(index=['s'], columns=[0])
.droplevel(0, axis=1).rename_axis(None, axis=1).rename_axis(None, axis=0)
)
dfp.columns = ['NAME','URL','DESCRIPTION','ID']
print(dfp)
Result
NAME URL DESCRIPTION ID
0 Kalim Aajiz https://en.wikipedia.org/wiki/Kalim_Aajiz Kalim Aajiz (1920 – 14 February 2015) was an I... 47137025
1 Robert Thurman https://en.wikipedia.org/wiki/Robert_Thurman Robert Alexander Farrar Thurman (born August 3... 475367
2 Ruskin Bond https://en.wikipedia.org/wiki/Ruskin_Bond Ruskin Bond (born 19 May 1934) is an Anglo Ind... 965456
CodePudding user response:
I don't think you need data
at all, you could directly use the attributes of wiki_page
:
df = pd.DataFrame(columns=["NAME", "URL", "DESCRIPTION", "ID"])
for i in results:
wiki_page = wikipedia.page(i)
df.loc[len(df.index)] = wiki_page.title, wiki_page.url, wiki_page.summary, wiki_page.pageid
or with pd.concat()
, as pd.append()
is deprecated:
for i in results:
wiki_page = wikipedia.page(i)
df = pd.concat([
df,
pd.DataFrame([wiki_page.title, wiki_page.url, wiki_page.summary, wiki_page.pageid],
columns=["NAME", "URL", "DESCRIPTION", "ID"])
], ignore_index=True)