Skip to content

Lab 1A - Data, Code & RStudio

Lab 1A - Data, Code & RStudio

Directions: Follow along with the slides, completing the questions in blue on your computer, and answering the questions in red in your journal.

Welcome to the labs!

  • Throughout the year, you'll be putting your data science skills to work by completing the labs.

  • You'll learn how to program in the R programming language.

    – The programming language used by actual data scientists.

  • Your code will be written in RStudio which is an easy to use interface for coding using R.

So let's get started!

  • The data for our first few labs comes from the Centers for Disease Control (CDC).

    – The CDC is a federal institution that studies public health.

  • Type these two commands into your console:

    data(cdc)
    View(cdc)
    
  • Describe the data that appeared after running View(cdc):

    Who is the information about?

    What sorts of information about them was collected?

  • To find out more information about the cdc data, type the command below into your console.

    – To get back to the slides find and click on the Viewer tab.

    ?cdc
    

Data: Variables & Observations

  • Data can be broken up into two parts.

    `1. Observations

    `2. Variables

    Observations are the who or what we are collecting data from/about.

    Variables are the measurements or characteristics about our observations.

  • If need be, re-type the command you used to View your data. Then answer the following:

    Based on the data, describe a few characteristics about the first observation.

    What does the first column tell us about our observations?

  • In order to describe the first observation, notice that you had to look at the first row of the spreadsheet. Each row, in this case, describes a person.

  • The columns of the spreadsheet represent variables.

Uncovering our Data's Structure

  • Now that we've looked at our data, let's look at how RStudio is organized.

  • RStudio's main window is composed of four panes

  • Find the pane that has a tab titled Environment and click on the tab.

    – This pane contains a list of everything that's currently available for R to use.

    – Notice that R knows we have our cdc data loaded.

  • How many students are in our cdc data set?

  • How many variables were measured for each student?

Some New Functions

  • Type the following commands into the console:

    dim(cdc)
    nrow(cdc)
    ncol(cdc)
    names(cdc)
    
  • Which of these functions tell us the number of observations in our data?

  • Which of these functions tell us the number of variables?

First Steps

  • Typing commands into the console is your first step into the larger world of programming or coding (terms which are often used interchangeably).

  • Coding is all about learning how to send instructions to your computer.

    – The way we speak to the computer, using a coding language, syntax.

  • R is one of many coding languages. Each coding language is slightly different, and these differences are reflected in the syntax.

  • Capitalization, spelling and punctuation are REALLY important.

Syntax matters

  • Run the following commands. What happens after each command?

    Names(cdc)
    NAMES(cdc)
    names(cdc)
    names(CDC)
    
  • Which does R understand?

R's most important syntax

  • Most of the commands you will be using follow the syntax below:

    function (y ~ x, data = ____ )
    
  • To create graphs or plots you need to provide R with the following:

    – The name of the R function, often the plot’s name, that tells the computer how to create your graph.

    • The variable(s) containing the information we want the function to use.

    • The dataset containing the variables.

  • Notice that when we analyze a single variable the value for y is left blank.

    bargraph(~grade, data = cdc)
    

  • Later on, we’ll see we can use this syntax to do more than create graphs.

Syntax in action

  • Search through the different panes. Find and then click on the Plots tab.

    • To get back to the slides, find and then click on the Viewer tab.
  • Which one of these plots would be useful for answering the question: Is it unusual for students in the CDC dataset to be taller than 1.8 meters?

  • Run the three commands below then answer the question that follows.

    histogram(~height, data = cdc)
    
    bargraph(~drive_text, data = cdc)
    
    xyplot(weight~height, data = cdc)
    
  • Do you think it’s unusual for students in the data to be taller than 1.8 meters? Why or why not?

  • Hint: Use the arrow keys on the Plots tab to toggle between the plots.

On your own:

  • After completing the lab, answer the following questions:

    What is public health and do we collect data about it?

    How do you think our data was collected? Does it include every high school aged student in the US?

    How might the CDC use this data? Who else could benefit from using this data?

    Write the code to visualize the distribution of weights of the students in the CDC data with a histogram. What is the typical weight?

    Write the code to create a bargraph to visualize the distribution of how often students ate fruit. About how many students did not eat fruit over the previous 7 days?