I have a data table about movies with approximately 1600000 rows and 9 columns. I want to create a table that shows the number of movies released per year. This table should include two columns of my current data frame: year
and n_movies
. The first column should contain the year, the second the number of movies in that year, and the third which would be a new column, counting up n_movies by the first year until the last year.
I manage to get the first two columns from my current data table, but I do not know how to get the third column, which should count up n_movies by first year.
I have tried the following code:
df.movies <- data.frame(dt.movies.actor$year,
dt.movies.actor$n_movies,
count(dt.movies.actor$n_movies))
Below an example of my current data frame.
movie actor year duration adult_movie rating votes n_movies
1: The Gay Divorcee (1934) Fred Astaire (b.1899) 1934 107 0 7.5 6446 7524
2: Roberta (1935) Fred Astaire (b.1899) 1935 106 0 7.1 2761 7708
3: Top Hat (1935) Fred Astaire (b.1899) 1935 101 0 7.8 16254 7708
4: Follow the Fleet (1936) Fred Astaire (b.1899) 1936 110 0 7.2 3745 8071
5: Swing Time (1936) Fred Astaire (b.1899) 1936 103 0 7.6 11487 8071
6: A Damsel in Distress (1937) Fred Astaire (b.1899) 1937 101 0 6.8 1657 8165
7: Shall We Dance (1937) Fred Astaire (b.1899) 1937 109 0 7.5 6143 8165
8: Carefree (1938) Fred Astaire (b.1899) 1938 83 0 7.1 3070 8170
9: The Story of Vernon and Irene Castle (1939) Fred Astaire (b.1899) 1939 93 0 6.9 2145 7846
10: Broadway Melody of 1940 (1940) Fred Astaire (b.1899) 1940 102 0 7.4 2001 7455
11: Second Chorus (1940) Fred Astaire (b.1899) 1940 84 0 5.8 1211 7455
12: You'll Never Get Rich (1941) Fred Astaire (b.1899) 1941 88 0 6.8 1995 6425
13: Holiday Inn (1942) Fred Astaire (b.1899) 1942 100 0 7.4 12406 5659
14: You Were Never Lovelier (1942) Fred Astaire (b.1899) 1942 97 0 7.3 3012 5659
15: The Sky's the Limit (1943) Fred Astaire (b.1899) 1943 89 0 6.4 770 5203
16: Yolanda and the Thief (1945) Fred Astaire (b.1899) 1945 108 0 6.0 793 4229
17: Blue Skies (1946) Fred Astaire (b.1899) 1946 99 0 6.5 1172 5206
18: Ziegfeld Follies (1945) Fred Astaire (b.1899) 1945 110 0 6.5 2727 4229
19: Easter Parade (1948) Fred Astaire (b.1899) 1948 103 0 7.4 8007 6378
20: The Barkleys of Broadway (1949) Fred Astaire (b.1899) 1949 109 0 7.0 2592 6921 ```
CodePudding user response:
Are you looking for such a solution?
library(dplyr)
df %>%
count(year, n_movies)
year n_movies n
<dbl> <dbl> <int>
1 1934 7524 1
2 1935 7708 2
3 1936 8071 2
4 1937 8165 2
5 1938 8170 1
6 1939 7846 1
7 1940 7455 2
8 1941 6425 1
9 1942 5659 2
10 1943 5203 1
11 1945 4229 2
12 1946 5206 1
13 1948 6378 1
data:
df <- structure(list(movie = c("The Gay Divorcee (1934)", "Roberta (1935)",
"Top Hat (1935)", "Follow the Fleet (1936)", "Swing Time (1936)",
"A Damsel in Distress (1937)", "Shall We Dance (1937)", "Carefree (1938)",
"The Story of Vernon and Irene Castle (1939)", "Broadway Melody of 1940 (1940)",
"Second Chorus (1940)", "You'll Never Get Rich (1941)", "Holiday Inn (1942)",
"You Were Never Lovelier (1942)", "The Sky's the Limit (1943)",
"Yolanda and the Thief (1945)", "Blue Skies (1946)", "Ziegfeld Follies (1945)",
"Easter Parade (1948)"), actor = c("Fred Astaire (b.1899)", "Fred Astaire (b.1899)",
"Fred Astaire (b.1899)", "Fred Astaire (b.1899)", "Fred Astaire (b.1899)",
"Fred Astaire (b.1899)", "Fred Astaire (b.1899)", "Fred Astaire (b.1899)",
"Fred Astaire (b.1899)", "Fred Astaire (b.1899)", "Fred Astaire (b.1899)",
"Fred Astaire (b.1899)", "Fred Astaire (b.1899)", "Fred Astaire (b.1899)",
"Fred Astaire (b.1899)", "Fred Astaire (b.1899)", "Fred Astaire (b.1899)",
"Fred Astaire (b.1899)", "Fred Astaire (b.1899)"), year = c(1934,
1935, 1935, 1936, 1936, 1937, 1937, 1938, 1939, 1940, 1940, 1941,
1942, 1942, 1943, 1945, 1946, 1945, 1948), duration = c(107,
106, 101, 110, 103, 101, 109, 83, 93, 102, 84, 88, 100, 97, 89,
108, 99, 110, 103), adult_movie = c(0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0), rating = c(7.5, 7.1, 7.8, 7.2,
7.6, 6.8, 7.5, 7.1, 6.9, 7.4, 5.8, 6.8, 7.4, 7.3, 6.4, 6, 6.5,
6.5, 7.4), votes = c(644, 276, 1625, 374, 1148, 165, 614, 307,
214, 200, 121, 199, 1240, 301, 77, 79, 117, 272, 800), n_movies = c(7524,
7708, 7708, 8071, 8071, 8165, 8165, 8170, 7846, 7455, 7455, 6425,
5659, 5659, 5203, 4229, 5206, 4229, 6378)), row.names = c(NA,
-19L), spec = structure(list(cols = list(movie = structure(list(), class = c("collector_character",
"collector")), actor = structure(list(), class = c("collector_character",
"collector")), year = structure(list(), class = c("collector_double",
"collector")), duration = structure(list(), class = c("collector_double",
"collector")), adult_movie = structure(list(), class = c("collector_double",
"collector")), rating = structure(list(), class = c("collector_double",
"collector")), votes = structure(list(), class = c("collector_double",
"collector")), n_movies = structure(list(), class = c("collector_double",
"collector")), ...9 = structure(list(), class = c("collector_skip",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), delim = ","), class = "col_spec"), problems = <pointer: 0x00000190f1f5db30>, class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"))
CodePudding user response:
Deleted since I was using the wrong column and it really seems too much of a hassle to go back and fix.
First construct a data.frame
temp <- data.frame( counts = c(table(dt.movies.actor$year)),
years=as.numeric(names(table(dt.movies.actor$year))) )
Then add a column that is cumsum()
of the counts:
final <- cbind(temp, running_tot <- cumsum(temp$counts) )
None of this is tested, since you ogffer output of a data.table object in a format that is messy. Learn to post easily pasted data objects with dput
.
I think this might give your the temp
results more elegantly:
temp <- as.data.frame( table(dt.movies.actor$year) )
... but the names of the columns are Var1
and Freq
, so you would need to proceed appropriately.
This is a simplified version of the TarJae data:
df <- structure(list(movie = c("The Gay Divorcee (1934)", "Roberta (1935)",
"Top Hat (1935)", "Follow the Fleet (1936)", "Swing Time (1936)",
"A Damsel in Distress (1937)", "Shall We Dance (1937)", "Carefree (1938)",
"The Story of Vernon and Irene Castle (1939)", "Broadway Melody of 1940 (1940)",
"Second Chorus (1940)", "You'll Never Get Rich (1941)", "Holiday Inn (1942)",
"You Were Never Lovelier (1942)", "The Sky's the Limit (1943)",
"Yolanda and the Thief (1945)", "Blue Skies (1946)", "Ziegfeld Follies (1945)",
"Easter Parade (1948)"), actor = c("Fred Astaire (b.1899)", "Fred Astaire (b.1899)",
"Fred Astaire (b.1899)", "Fred Astaire (b.1899)", "Fred Astaire (b.1899)",
"Fred Astaire (b.1899)", "Fred Astaire (b.1899)", "Fred Astaire (b.1899)",
"Fred Astaire (b.1899)", "Fred Astaire (b.1899)", "Fred Astaire (b.1899)",
"Fred Astaire (b.1899)", "Fred Astaire (b.1899)", "Fred Astaire (b.1899)",
"Fred Astaire (b.1899)", "Fred Astaire (b.1899)", "Fred Astaire (b.1899)",
"Fred Astaire (b.1899)", "Fred Astaire (b.1899)"), year = c(1934,
1935, 1935, 1936, 1936, 1937, 1937, 1938, 1939, 1940, 1940, 1941,
1942, 1942, 1943, 1945, 1946, 1945, 1948), duration = c(107,
106, 101, 110, 103, 101, 109, 83, 93, 102, 84, 88, 100, 97, 89,
108, 99, 110, 103), adult_movie = c(0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0), rating = c(7.5, 7.1, 7.8, 7.2,
7.6, 6.8, 7.5, 7.1, 6.9, 7.4, 5.8, 6.8, 7.4, 7.3, 6.4, 6, 6.5,
6.5, 7.4), votes = c(644, 276, 1625, 374, 1148, 165, 614, 307,
214, 200, 121, 199, 1240, 301, 77, 79, 117, 272, 800), n_movies = c(7524,
7708, 7708, 8071, 8071, 8165, 8165, 8170, 7846, 7455, 7455, 6425,
5659, 5659, 5203, 4229, 5206, 4229, 6378)), row.names=c(NA, -19L), class= "data.frame")