Unit 4 Vocabulary

census

an official count or survey of a population, typically recording various details of individuals

Classification and Regression Trees (CART)

a predictive algorithm used in machine leanring; it explains how a target variable's values can be predicted based on other values

classify

is the problem of identifying which of a set of categories (sub-populations) an observation (or observations), belongs to

cluster

a group of similar things or people positioned or occurring closely together

clustering

is the process of grouping a set of objects (or people) in such a way that objects (or people) in the same group are more similar to each other than those in other groups

correlation coefficient

a statistical measure that calculates the strength of the relationship between the relative movements of two variables

decision tree

a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance outcomes

k-means

aims to partition data into k clusters in a way that data points in the same cluster are similar and data points in the different clusters are farther apart

linear

used to describe a straight-line relationship between two variables

line of best fit

a line through a scatterplot of data points that best expresses the relationship between those points

market

refers to the live streaming of trade-related data; it encompasses a range of information such as price, bid/ask quotes and market volume

mean absolute error

the amount of error in your measurements; it is the difference between the measured value adn the "true" value

mean squared error

tells you how close a regression line is to a set of points; is determined by finding the average of the squared differences between your guess and the actual values

misclassification rate

the proportion of observations who were predicted to be in one category but were actually in another

model

provides a simplified version or representation of real-life situations or data. It is used to make sense of data or make predictions based on it.

negative association

when the values of one variable tend to decrease as the values of the other variable increase

network

a system designed to transfer data from one network access point to one other or more network access points via data switching, transmission lines, and system controls

no association

means that there is no line and all the dots are scattered

nodes

a point of intersection/connection within a data communication network

non-linear

a form of regression analysis in which observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables; the data are fitted by a method of successive approximations

observed value

the value that is actually observed (what actually happened)

polynomial trends

describes a pattern in data that is curved or breaks from a straight linear trend; it often occurs in a large set of data that contains many fluctuations

positive association

when the values of one variable tend to increase as the values of the other variable increase

predicted value

shows the projected equation of the line of best fit

regression line

a regression line is a line that best describes the behavior of a set of data

residual

the difference between our prediction and the actual outcome; also called an "error"

rule

a set way to calculate or solve a problem

shape

describes the distribution (or pattern) of the data within a dataset

strength of association

how much two variables covary and the extent to which the INDEPENDENT VARIABLE affects the DEPENDENT VARIABLE

test data

a random subset consisting of about 15-25% of the original dataset on which a model is tested

training data

a random subset consisting of about 75-85% of the original dataset on which a model is trained

trend

often referred to as a line of best fit, is a line that is used to represent the behavior of a set of data to determine if there is a certain pattern