I am not certain how to describe this situation. Suppose I have the well-defined following table in dataframe pandas,
0 1 2 3 4 5 ... 2949 2950 2951 2952 2953 2954
0.txt html head meta meta meta meta ...
107.txt html head title meta meta meta ...
125.txt html head title style body div ...
190.txt html head meta title style body ...
202.txt html head meta title link style
And I want to make this table to spread out, columns representing the unique html tag and the value representing the specified row's count..
html head meta style link body ...
0.txt 1 1 4 2 1 2 ...
107.txt 1 2 3 0 0 1 ...
Somthing like the above.. I have counted the total 88 distinct html headers are in the table so the column count might be 88. If this turn out to be success, then I will apply padnas' describe()
, value_counts()
function to find out more about this tags' statistics.. However, I am stuck with the above. Please give me some ideas to tackle this. Thank you..
CodePudding user response:
IIUC, you can first stack
then use groupby.value_counts
to get the stats per initial row, then unstack
to get the expected result. With the data provided, for the first 3 rows and 6 columns, you get.
res= (
df.stack()
.groupby(level=0).value_counts()
.unstack(fill_value=0)
)
print(res)
# body div head html meta style title
# 0.txt 0 0 1 1 4 0 0
# 107.txt 0 0 1 1 3 0 1
# 125.txt 1 1 1 1 0 1 1
CodePudding user response:
To spread out the table in the way you described, you can use the pandas method pivot_table(). This method takes a number of arguments, including the dataframe you want to transform, the columns you want to pivot, and the values to fill in the new columns.
Here is an example of how you might use pivot_table() to spread out your dataframe:
# Import the pandas library
import pandas as pd
# Load your dataframe
df = pd.read_csv("your_data.csv")
# Use the pivot_table() method to spread out the data
pivoted_df = df.pivot_table(index=["row_name"], columns="column_name", values="value_column")
This will create a new dataframe with the rows representing the original row names, the columns representing the unique html tags, and the values representing the count of each tag in each row. After that you can use the already listed describe() and value_counts() methods on this new dataframe to get statistics about the html tags.