convert for-loop output into dataframe python-CodePudding

I am trying to convert the output of this code into a dataframe, but do not know how. What is a good way to turn the output columns (string and frequency) into a dataframe?

# Extract data
url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)
first_df = df[0]


column = "Select Investors"
all_investor = []
for i in first_df[column]:
    all_investor  = str(i).lower().split(',')

# Calculate frequency
for string in all_investor:
    string = string.strip()
    frequency = first_df[column].apply(
        lambda x: string in str(x).lower()).sum()
    print(string, frequency)

Output:
andreessen horowitz 41
new enterprise associates 21
battery ventures 14
index ventures 30
dst global 19
ribbit capital 8
forerunner ventures 4
crosslink capital 4
homebrew 2
sequoia capital 115
thoma bravo 3
softbank 50
tencent holdings 28
lightspeed india partners 4
sequoia capital india 25
ggv capital 14

....

CodePudding user response：

Use str.split and value_counts:

>>> df['Select Investors'].str.split(', ').explode().str.lower().value_counts()

accel                      54
tiger global management    48
sequoia capital china      46
andreessen horowitz        42
sequoia capital            41
                           ..
.406 ventures               1
transamerica ventures       1
crane venture partners      1
geekdom fund                1
endiya partners             1
Name: Select Investors, Length: 1187, dtype: int64

CodePudding user response：

# Extract data
url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)
first_df = df[0]


column = "Select Investors"
all_investor = []
for i in first_df[column]:
    all_investor  = [j.strip() for j in str(i).lower().split(',') if j.strip()]
all_investor = set(all_investor)

all_data = []
# Calculate frequency
for string in all_investor:
    frequency = first_df[column].apply(
        lambda x: string in str(x).lower()).sum()
    all_data.append([string, frequency])

new_df = pd.DataFrame(all_data, columns=["Investor", "Frequency"])
new_df = new_df.sort_values(by='Frequency',ascending=False).reset_index()
new_df

Output:
    index       Investor                            Frequency
0   854         sequoia capital                     115
1   626         ing                                 90
2   765         accel                               62
3   1025        tiger global                        50
4   964         softbank                            50
...     ...     ...     ...
1180    486     quiet capital                       1
1181    487     york capital management             1
1182    489     ewtp capital                        1
1183    490     kleiner perkins caulfield & byers   1
1184    1184    google capital                      1

CodePudding user response：

Use Series.str.split, reshape by DataFrame.stack, convert to lowercase by Series.str.lower and last count by Series.value_counts:

url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)[0]

s = df['Select Investors'].str.split(', ', expand=True).stack().str.lower().value_counts()
print (s)
accel                              54
tiger global management            48
sequoia capital china              46
andreessen horowitz                42
sequoia capital                    41
                                   ..
almaz capital partners              1
commerzventures                     1
sunley house capital management     1
lockheed martin ventures            1
14w                                 1
Length: 1187, dtype: int64

For DataFrame use:

df = s.rename_axis('values').reset_index(name='count')
print (df)
                               values  count
0                               accel     54
1             tiger global management     48
2               sequoia capital china     46
3                 andreessen horowitz     42
4                     sequoia capital     41
                              ...    ...
1182           almaz capital partners      1
1183                  commerzventures      1
1184  sunley house capital management      1
1185         lockheed martin ventures      1
1186                              14w      1

[1187 rows x 2 columns]

If want modify your solution:

from collections import Counter

url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)[0]


column = "Select Investors"
all_investor = [j.strip() for i in df[column]  for j in str(i).lower().split(',')]

df1 = (pd.DataFrame(Counter(all_investor).items(), columns=['vals','count'])
         .sort_values(by='count',ascending=False, ignore_index=True))
print (df1)
                         vals  count
0                       accel     54
1     tiger global management     48
2       sequoia capital china     46
3         andreessen horowitz     42
4             sequoia capital     41
                      ...    ...
1180          futurex capital      1
1181            quiet capital      1
1182       white star capital      1
1183   almaz capital partners      1
1184          endiya partners      1

[1185 rows x 2 columns]