LAB 4F: This Model Is Big Enough for All of Us
Lab 4F - This model is big enough for all of us!
Directions: Follow along with the slides, completing the questions in blue on your computer, and answering the questions in red in your journal.
Building better models
-
So far, in the labs, we've learned how to make predictions using the line of best fit, also knowns as linear models or regression models.
-
We've also learned how to measure our model's prediction accuracy by cross-validation.
-
In this lab, we'll investigate the following question:
Will including more variables in our model improve its predictions?
Divide & Conquer
-
Start by loading the
moviedata and split it into two sets (See Lab 4C for help).– A set named
trainingthat includes 75% of the data.– A set named
testthat includes the remaining 25%.- Remember to use
set.seed.
- Remember to use
-
Create a linear model, using the
trainingdata, that predictsgrossusingruntime.– Compute the MSE of the model by making predictions for the
testdata. -
Do you think that a movie's
runtimeis the only factor that goes into how much a movie will make? What else might affect a movie'sgross?
Including more info
-
Data scientists often find that including more relevant information in their models leads to better predictions.
– Fill in the blanks below to predict
grossusingruntimeandreviews_num.lm(____ ~ ____ + ____, data = training) -
Does this new model make more or less accurate predictions? Describe the process you used to arrive at your conclusion.
-
Write down the code you would use to include a 3rd variable, of your choosing, in your
lm().
Own your own
-
Write down which other variables in the
moviedata you think would help you make better predictions.– Are there any variables that you think would not improve our predictions?
-
Create a model for all of the variables you think are relevant.
– Assess whether your model makes more accurate predictions for the
test
data than the model that included onlyruntimeandreviews_num -
With your neighbors, determine which combination of variables leads to the best predictions for the
testdata.