I am trying to convert the output of this code into a dataframe, but do not know how. What is a good way to turn the output columns (string and frequency) into a dataframe?
# Extract data
url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)
first_df = df[0]
column = "Select Investors"
all_investor = []
for i in first_df[column]:
all_investor = str(i).lower().split(',')
# Calculate frequency
for string in all_investor:
string = string.strip()
frequency = first_df[column].apply(
lambda x: string in str(x).lower()).sum()
print(string, frequency)
Output:
andreessen horowitz 41
new enterprise associates 21
battery ventures 14
index ventures 30
dst global 19
ribbit capital 8
forerunner ventures 4
crosslink capital 4
homebrew 2
sequoia capital 115
thoma bravo 3
softbank 50
tencent holdings 28
lightspeed india partners 4
sequoia capital india 25
ggv capital 14
....
CodePudding user response:
Use str.split
and value_counts
:
>>> df['Select Investors'].str.split(', ').explode().str.lower().value_counts()
accel 54
tiger global management 48
sequoia capital china 46
andreessen horowitz 42
sequoia capital 41
..
.406 ventures 1
transamerica ventures 1
crane venture partners 1
geekdom fund 1
endiya partners 1
Name: Select Investors, Length: 1187, dtype: int64
CodePudding user response:
# Extract data
url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)
first_df = df[0]
column = "Select Investors"
all_investor = []
for i in first_df[column]:
all_investor = [j.strip() for j in str(i).lower().split(',') if j.strip()]
all_investor = set(all_investor)
all_data = []
# Calculate frequency
for string in all_investor:
frequency = first_df[column].apply(
lambda x: string in str(x).lower()).sum()
all_data.append([string, frequency])
new_df = pd.DataFrame(all_data, columns=["Investor", "Frequency"])
new_df = new_df.sort_values(by='Frequency',ascending=False).reset_index()
new_df
Output:
index Investor Frequency
0 854 sequoia capital 115
1 626 ing 90
2 765 accel 62
3 1025 tiger global 50
4 964 softbank 50
... ... ... ...
1180 486 quiet capital 1
1181 487 york capital management 1
1182 489 ewtp capital 1
1183 490 kleiner perkins caulfield & byers 1
1184 1184 google capital 1
CodePudding user response:
Use Series.str.split
, reshape by DataFrame.stack
, convert to lowercase by Series.str.lower
and last count by Series.value_counts
:
url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)[0]
s = df['Select Investors'].str.split(', ', expand=True).stack().str.lower().value_counts()
print (s)
accel 54
tiger global management 48
sequoia capital china 46
andreessen horowitz 42
sequoia capital 41
..
almaz capital partners 1
commerzventures 1
sunley house capital management 1
lockheed martin ventures 1
14w 1
Length: 1187, dtype: int64
For DataFrame use:
df = s.rename_axis('values').reset_index(name='count')
print (df)
values count
0 accel 54
1 tiger global management 48
2 sequoia capital china 46
3 andreessen horowitz 42
4 sequoia capital 41
... ...
1182 almaz capital partners 1
1183 commerzventures 1
1184 sunley house capital management 1
1185 lockheed martin ventures 1
1186 14w 1
[1187 rows x 2 columns]
If want modify your solution:
from collections import Counter
url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)[0]
column = "Select Investors"
all_investor = [j.strip() for i in df[column] for j in str(i).lower().split(',')]
df1 = (pd.DataFrame(Counter(all_investor).items(), columns=['vals','count'])
.sort_values(by='count',ascending=False, ignore_index=True))
print (df1)
vals count
0 accel 54
1 tiger global management 48
2 sequoia capital china 46
3 andreessen horowitz 42
4 sequoia capital 41
... ...
1180 futurex capital 1
1181 quiet capital 1
1182 white star capital 1
1183 almaz capital partners 1
1184 endiya partners 1
[1185 rows x 2 columns]