Count items separated by commas from a list-CodePudding

I have a pyspark dataframe that looks like this:

df = spark.sql("""

SELECT list

FROM categories

""")

df.show()

list
1,1,1,2,2,apple
apple,orange,1,2

How can I generate a field that counts the elements separated by commas? The ideal dataframe looks like this:

list	count
1,1,1,2,2,apple	6
apple,orange,1,2	4

I would also like to create looks like this:

list	frequency_count
1	4
2	3
apple	2
orange	1

CodePudding user response：

For me, it is laborious to write using dataframe, and the SQL implementation is as follows.

count_df = df.withColumn('count', F.size(F.split('list', ',')))
count_df.show(truncate=False)

df.createOrReplaceTempView('tmp')
freq_sql = """
    select list,count(*) count from
        (select explode(flatten(collect_list(split(list, ',')))) list
        from tmp)
    group by list
"""
freq_df = spark.sql(freq_sql)
freq_df.show(truncate=False)

CodePudding user response：

You actually have two questions. To get the count of items of a ',' separator string you can do like this:

import pandas as pd
df = pd.DataFrame(data=["1,1,1,2,2,apple", "apple,orange,1,2"], columns=["list"])
df["count"] = df["list"].apply(lambda x: len(x.split(",")))
print(df)

which gives as output:

                  list  count
0      1,1,1,2,2,apple      6
1     apple,orange,1,2      4

I use a lambda function here which does the operation split on each item (returning a list of separate items in the string) and then counts the items with len(). You can use the apply method to apply the lambda function to each item in the column.

In the second question you want to count the unique elements of each item in the dataframe. You can first use the split method to split the string items into separate columns:

df_items = df["list"].str.split(",", expand=True)
print(df_items)

which gives you a dataframe like this:

       0       1  2  3     4      5
0      1       1  1  2     2  apple
1  apple  orange  1  2  None   None

Then, since the location of each unique item does not matter you can glue all the columns together to one single column using the melt method. The melt method puts all the items in a column called value. You can get the item count of the columns using value_count. The script looks like this:

frequency_count = pd.melt(df_items)['value'].value_counts()
frequency_count.index.name = "list"
frequency_count = pd.DataFrame(frequency_count).rename(columns={"value":  "frequency_count"})

which gives you the output

        frequency_count
list                   
1                     4
2                     3
apple                 2
orange                1