Home > Mobile >  Create data.frame with count function
Create data.frame with count function

Time:02-13

I have a data table about movies with approximately 1600000 rows and 9 columns. I want to create a table that shows the number of movies released per year. This table should include two columns of my current data frame: year and n_movies. The first column should contain the year, the second the number of movies in that year, and the third which would be a new column, counting up n_movies by the first year until the last year.

I manage to get the first two columns from my current data table, but I do not know how to get the third column, which should count up n_movies by first year.

I have tried the following code:

df.movies <- data.frame(dt.movies.actor$year, 
                        dt.movies.actor$n_movies, 
                        count(dt.movies.actor$n_movies))

Below an example of my current data frame.

                                           movie                 actor year duration adult_movie rating votes n_movies
 1:                     The Gay Divorcee (1934) Fred Astaire (b.1899) 1934      107           0    7.5  6446     7524
 2:                              Roberta (1935) Fred Astaire (b.1899) 1935      106           0    7.1  2761     7708
 3:                              Top Hat (1935) Fred Astaire (b.1899) 1935      101           0    7.8 16254     7708
 4:                     Follow the Fleet (1936) Fred Astaire (b.1899) 1936      110           0    7.2  3745     8071
 5:                           Swing Time (1936) Fred Astaire (b.1899) 1936      103           0    7.6 11487     8071
 6:                 A Damsel in Distress (1937) Fred Astaire (b.1899) 1937      101           0    6.8  1657     8165
 7:                       Shall We Dance (1937) Fred Astaire (b.1899) 1937      109           0    7.5  6143     8165
 8:                             Carefree (1938) Fred Astaire (b.1899) 1938       83           0    7.1  3070     8170
 9: The Story of Vernon and Irene Castle (1939) Fred Astaire (b.1899) 1939       93           0    6.9  2145     7846
10:              Broadway Melody of 1940 (1940) Fred Astaire (b.1899) 1940      102           0    7.4  2001     7455
11:                        Second Chorus (1940) Fred Astaire (b.1899) 1940       84           0    5.8  1211     7455
12:                You'll Never Get Rich (1941) Fred Astaire (b.1899) 1941       88           0    6.8  1995     6425
13:                          Holiday Inn (1942) Fred Astaire (b.1899) 1942      100           0    7.4 12406     5659
14:              You Were Never Lovelier (1942) Fred Astaire (b.1899) 1942       97           0    7.3  3012     5659
15:                  The Sky's the Limit (1943) Fred Astaire (b.1899) 1943       89           0    6.4   770     5203
16:                Yolanda and the Thief (1945) Fred Astaire (b.1899) 1945      108           0    6.0   793     4229
17:                           Blue Skies (1946) Fred Astaire (b.1899) 1946       99           0    6.5  1172     5206
18:                     Ziegfeld Follies (1945) Fred Astaire (b.1899) 1945      110           0    6.5  2727     4229
19:                        Easter Parade (1948) Fred Astaire (b.1899) 1948      103           0    7.4  8007     6378
20:             The Barkleys of Broadway (1949) Fred Astaire (b.1899) 1949      109           0    7.0  2592     6921 ```

CodePudding user response:

Are you looking for such a solution?

library(dplyr)

df %>% 
  count(year, n_movies)
    year n_movies     n
   <dbl>    <dbl> <int>
 1  1934     7524     1
 2  1935     7708     2
 3  1936     8071     2
 4  1937     8165     2
 5  1938     8170     1
 6  1939     7846     1
 7  1940     7455     2
 8  1941     6425     1
 9  1942     5659     2
10  1943     5203     1
11  1945     4229     2
12  1946     5206     1
13  1948     6378     1

data:

df <- structure(list(movie = c("The Gay Divorcee (1934)", "Roberta (1935)", 
"Top Hat (1935)", "Follow the Fleet (1936)", "Swing Time (1936)", 
"A Damsel in Distress (1937)", "Shall We Dance (1937)", "Carefree (1938)", 
"The Story of Vernon and Irene Castle (1939)", "Broadway Melody of 1940 (1940)", 
"Second Chorus (1940)", "You'll Never Get Rich (1941)", "Holiday Inn (1942)", 
"You Were Never Lovelier (1942)", "The Sky's the Limit (1943)", 
"Yolanda and the Thief (1945)", "Blue Skies (1946)", "Ziegfeld Follies (1945)", 
"Easter Parade (1948)"), actor = c("Fred Astaire (b.1899)", "Fred Astaire (b.1899)", 
"Fred Astaire (b.1899)", "Fred Astaire (b.1899)", "Fred Astaire (b.1899)", 
"Fred Astaire (b.1899)", "Fred Astaire (b.1899)", "Fred Astaire (b.1899)", 
"Fred Astaire (b.1899)", "Fred Astaire (b.1899)", "Fred Astaire (b.1899)", 
"Fred Astaire (b.1899)", "Fred Astaire (b.1899)", "Fred Astaire (b.1899)", 
"Fred Astaire (b.1899)", "Fred Astaire (b.1899)", "Fred Astaire (b.1899)", 
"Fred Astaire (b.1899)", "Fred Astaire (b.1899)"), year = c(1934, 
1935, 1935, 1936, 1936, 1937, 1937, 1938, 1939, 1940, 1940, 1941, 
1942, 1942, 1943, 1945, 1946, 1945, 1948), duration = c(107, 
106, 101, 110, 103, 101, 109, 83, 93, 102, 84, 88, 100, 97, 89, 
108, 99, 110, 103), adult_movie = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0), rating = c(7.5, 7.1, 7.8, 7.2, 
7.6, 6.8, 7.5, 7.1, 6.9, 7.4, 5.8, 6.8, 7.4, 7.3, 6.4, 6, 6.5, 
6.5, 7.4), votes = c(644, 276, 1625, 374, 1148, 165, 614, 307, 
214, 200, 121, 199, 1240, 301, 77, 79, 117, 272, 800), n_movies = c(7524, 
7708, 7708, 8071, 8071, 8165, 8165, 8170, 7846, 7455, 7455, 6425, 
5659, 5659, 5203, 4229, 5206, 4229, 6378)), row.names = c(NA, 
-19L), spec = structure(list(cols = list(movie = structure(list(), class = c("collector_character", 
"collector")), actor = structure(list(), class = c("collector_character", 
"collector")), year = structure(list(), class = c("collector_double", 
"collector")), duration = structure(list(), class = c("collector_double", 
"collector")), adult_movie = structure(list(), class = c("collector_double", 
"collector")), rating = structure(list(), class = c("collector_double", 
"collector")), votes = structure(list(), class = c("collector_double", 
"collector")), n_movies = structure(list(), class = c("collector_double", 
"collector")), ...9 = structure(list(), class = c("collector_skip", 
"collector"))), default = structure(list(), class = c("collector_guess", 
"collector")), delim = ","), class = "col_spec"), problems = <pointer: 0x00000190f1f5db30>, class = c("spec_tbl_df", 
"tbl_df", "tbl", "data.frame"))

CodePudding user response:

Deleted since I was using the wrong column and it really seems too much of a hassle to go back and fix.

First construct a data.frame

temp <- data.frame( counts = c(table(dt.movies.actor$year)), 
                    years=as.numeric(names(table(dt.movies.actor$year))) )

Then add a column that is cumsum() of the counts:

final <- cbind(temp, running_tot <- cumsum(temp$counts) )

None of this is tested, since you ogffer output of a data.table object in a format that is messy. Learn to post easily pasted data objects with dput.

I think this might give your the temp results more elegantly:

temp <- as.data.frame( table(dt.movies.actor$year) )

... but the names of the columns are Var1 and Freq, so you would need to proceed appropriately.

This is a simplified version of the TarJae data:

df <- structure(list(movie = c("The Gay Divorcee (1934)", "Roberta (1935)", 
                               "Top Hat (1935)", "Follow the Fleet (1936)", "Swing Time (1936)", 
                               "A Damsel in Distress (1937)", "Shall We Dance (1937)", "Carefree (1938)", 
                               "The Story of Vernon and Irene Castle (1939)", "Broadway Melody of 1940 (1940)", 
                               "Second Chorus (1940)", "You'll Never Get Rich (1941)", "Holiday Inn (1942)", 
                               "You Were Never Lovelier (1942)", "The Sky's the Limit (1943)", 
                               "Yolanda and the Thief (1945)", "Blue Skies (1946)", "Ziegfeld Follies (1945)", 
                               "Easter Parade (1948)"), actor = c("Fred Astaire (b.1899)", "Fred Astaire (b.1899)", 
                                                                  "Fred Astaire (b.1899)", "Fred Astaire (b.1899)", "Fred Astaire (b.1899)", 
                                                                  "Fred Astaire (b.1899)", "Fred Astaire (b.1899)", "Fred Astaire (b.1899)", 
                                                                  "Fred Astaire (b.1899)", "Fred Astaire (b.1899)", "Fred Astaire (b.1899)", 
                                                                  "Fred Astaire (b.1899)", "Fred Astaire (b.1899)", "Fred Astaire (b.1899)", 
                                                                  "Fred Astaire (b.1899)", "Fred Astaire (b.1899)", "Fred Astaire (b.1899)", 
                                                                  "Fred Astaire (b.1899)", "Fred Astaire (b.1899)"), year = c(1934, 
                                                                                                                              1935, 1935, 1936, 1936, 1937, 1937, 1938, 1939, 1940, 1940, 1941, 
                                                                                                                              1942, 1942, 1943, 1945, 1946, 1945, 1948), duration = c(107, 
                                                                                                                                                                                      106, 101, 110, 103, 101, 109, 83, 93, 102, 84, 88, 100, 97, 89, 
                                                                                                                                                                                      108, 99, 110, 103), adult_movie = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 
                                                                                                                                                                                                                          0, 0, 0, 0, 0, 0, 0, 0, 0, 0), rating = c(7.5, 7.1, 7.8, 7.2, 
                                                                                                                                                                                                                                                                    7.6, 6.8, 7.5, 7.1, 6.9, 7.4, 5.8, 6.8, 7.4, 7.3, 6.4, 6, 6.5, 
                                                                                                                                                                                                                                                                    6.5, 7.4), votes = c(644, 276, 1625, 374, 1148, 165, 614, 307, 
                                                                                                                                                                                                                                                                                         214, 200, 121, 199, 1240, 301, 77, 79, 117, 272, 800), n_movies = c(7524, 
                                                                                                                                                                                                                                                                                                                                                             7708, 7708, 8071, 8071, 8165, 8165, 8170, 7846, 7455, 7455, 6425, 
                                                                                                                                                                                                                                                                                                                                                             5659, 5659, 5203, 4229, 5206, 4229, 6378)), row.names=c(NA, -19L), class= "data.frame")
  • Related