how to convert tuples in a column rows of a pandas dataframe into repeating rows and columns?-CodePudding

I have a dataframe which contains the following data (only 3 samples are provided here):

data = {'Department' : ['D1', 'D2', 'D3'],
'TopWords' : [[('cat', 6), ('project', 6), ('dog', 6), ('develop', 4), ('smooth', 4), ('efficient', 4), ('administrative', 4), ('procedure', 4), ('establishment', 3), ('matter', 3)],
[('management', 21), ('satisfaction', 12), ('within', 9), ('budget', 9), ('township', 9), ('site', 9), ('periodic', 9), ('admin', 9), ('maintain', 9), ('guest', 6)],
[('manage', 2), ('ir', 2), ('mines', 2), ('implimentation', 2), ('clrg', 2), ('act', 2), ('implementations', 2), ('office', 2), ('maintenance', 2), ('administration', 2)]]}


# Create DataFrame
df = pd.DataFrame(data)

Basically, each row contains a tuple of top 10 words along with their frequencies in each of Department.

I wanted to create a dataframe where (let department name get repeated and) each row contains the word from tuple in one column and frequency count in the other columns so that it should look like something this:

Department  Word            Counts
D1          cat             6
D1          project         6
D1          dog             6
D1          develop         4
D1          smooth          4
D1          efficient       4
D1          administrative  4
D1          procedure       4
D1          establishment   3
D1          matter          3
D2          management      21
D2          satisfaction    12
D2          within          9
D2          budget          9
D2          township        9

Is there any work around this type of conversion?

CodePudding user response：

I'd suggest you do the wrangling before loading into a dataframe, with the data dictionary:

length = [len(entry) for entry in data['TopWords']]
department = {'Department' : np.repeat(data['Department'], length)}
(pd
.DataFrame([ent for entry in data['TopWords'] for ent in entry],     
            columns = ['Word', 'Counts'])
.assign(**department)
)

               Word  Counts Department
0               cat       6         D1
1           project       6         D1
2               dog       6         D1
3           develop       4         D1
4            smooth       4         D1
5         efficient       4         D1
6    administrative       4         D1
7         procedure       4         D1
8     establishment       3         D1
9            matter       3         D1
10       management      21         D2
11     satisfaction      12         D2
12           within       9         D2
13           budget       9         D2
14         township       9         D2
15             site       9         D2
16         periodic       9         D2
17            admin       9         D2
18         maintain       9         D2
19            guest       6         D2
20           manage       2         D3
21               ir       2         D3
22            mines       2         D3
23   implimentation       2         D3
24             clrg       2         D3
25              act       2         D3
26  implementations       2         D3
27           office       2         D3
28      maintenance       2         D3
29   administration       2         D3

CodePudding user response：

First, use DataFrame.explode to separate the list elements into different rows. Then split the tuples into different columns, e.g. using DataFrame.assign Series.str

res = (
    df.explode('TopWords', ignore_index=True)
      .assign(Word=lambda df: df['TopWords'].str[0], 
              Counts=lambda df: df['TopWords'].str[1])
      .drop(columns='TopWords')
)

Output:

>>> res 

   Department             Word  Counts
0          D1              cat       6
1          D1          project       6
2          D1              dog       6
3          D1          develop       4
4          D1           smooth       4
5          D1        efficient       4
6          D1   administrative       4
7          D1        procedure       4
8          D1    establishment       3
9          D1           matter       3
10         D2       management      21
11         D2     satisfaction      12
12         D2           within       9
13         D2           budget       9
14         D2         township       9
15         D2             site       9
16         D2         periodic       9
17         D2            admin       9
18         D2         maintain       9
19         D2            guest       6
20         D3           manage       2
21         D3               ir       2
22         D3            mines       2
23         D3   implimentation       2
24         D3             clrg       2
25         D3              act       2
26         D3  implementations       2
27         D3           office       2
28         D3      maintenance       2
29         D3   administration       2

As @sammywemmy suggested, if you are dealing with a considerable amount of data, it will be faster if you wrangle it before loading it into a DataFrame.

Another way of doing it using a nested loop

data = {'Department' : ['D1', 'D2', 'D3'],
'TopWords' : [[('cat', 6), ('project', 6), ('dog', 6), ('develop', 4), ('smooth', 4), ('efficient', 4), ('administrative', 4), ('procedure', 4), ('establishment', 3), ('matter', 3)],
[('management', 21), ('satisfaction', 12), ('within', 9), ('budget', 9), ('township', 9), ('site', 9), ('periodic', 9), ('admin', 9), ('maintain', 9), ('guest', 6)],
[('manage', 2), ('ir', 2), ('mines', 2), ('implimentation', 2), ('clrg', 2), ('act', 2), ('implementations', 2), ('office', 2), ('maintenance', 2), ('administration', 2)]]}

records = []
for idx, top_words_list in enumerate(data['TopWords']):
    for word, count in top_words_list:
        rec = {
            'Department': data['Department'][idx],
            'Word': word,
            'Count': count
        }
        records.append(rec)
        
res = pd.DataFrame(records)

CodePudding user response：

In addition to @sammywemmy answer, following approach would not need numpy package, however due to double loops, it might not be as performant in large number of datasets.

df = {"Department": [], "Words": [], "Count": []}
for i in range(len(data["Department"])):
    for j in range(len(data["TopWords"][i])):
        df["Department"].append(data["Department"][i])
        df["Words"].append(data["TopWords"][i][j][0])
        df["Count"].append(data["TopWords"][i][j][1])

print(pd.DataFrame(df))

# (SUGGESTION) A more pythonic version of the code above 

d = {"Department": [], "Words": [], "Count": []}
for idx, department in enumerate(data["Department"]):
    for word, count in data["TopWords"][idx]:
        d["Department"].append(department)
        d["Words"].append(word)
        d["Count"].append(count)

print(pd.DataFrame(d))

CodePudding user response：

One option using a dictionary comprehension:

(df
 .drop(columns='TopWords')
 .join(pd.concat({k: pd.DataFrame(x, columns=['Word', 'Counts'])
                  for k,x in enumerate(df['TopWords'])}).droplevel(1))

)

output:

  Department             Word  Counts
0         D1              cat       6
0         D1          project       6
0         D1              dog       6
0         D1          develop       4
0         D1           smooth       4
0         D1        efficient       4
0         D1   administrative       4
0         D1        procedure       4
0         D1    establishment       3
0         D1           matter       3
1         D2       management      21
1         D2     satisfaction      12
1         D2           within       9
1         D2           budget       9
1         D2         township       9
1         D2             site       9
1         D2         periodic       9
1         D2            admin       9
1         D2         maintain       9
1         D2            guest       6
2         D3           manage       2
2         D3               ir       2
2         D3            mines       2
2         D3   implimentation       2
2         D3             clrg       2
2         D3              act       2
2         D3  implementations       2
2         D3           office       2
2         D3      maintenance       2
2         D3   administration       2