# Linear relationship between the dependent and independent variables

### Identify dependent & independent variables | Algebra (practice) | Khan Academy Practice figuring out if a variable is dependent or independent. The relationship between these two variables can be expressed by the following equation. Linear Regression. Y = a X + b. • To find the relationship between Y and X which yields values of Y with the least error. Dependent. Variable. Independent. These regression estimates are used to explain the relationship between one dependent variable and one or more independent variables. The simplest form of .

### Dependent and independent variables review (article) | Khan Academy

If we now take the log of both sides of the equation, the residual error term will become an additive factor in a linear equation, and we can go ahead and estimate b1 via standard multiple regression. The general growth model, is similar to the example that we previously considered: The parameter b0 in this model represents the maximum growth value. Models for Binary Responses: It is not uncommon that a dependent or response variable is binary in nature, that is, that it can have only two possible values.

For example, patients either do or do not recover from an injury; job applicants either succeed or fail at an employment test, subscribers to a journal either do or do not renew a subscription, coupons may or may not be returned, etc.

However, there is a problem: Therefore, it will inevitably fit a model that leads to predicted values that are greater than 1 or less than 0.

However, predicted values that are greater than 1 or less than 0 are not valid; thus, the restriction in the range of the binary variable e. We could rephrase the regression problem so that, rather than predicting a binary variable, we are predicting a continuous variable that naturally stays within the bounds. The two most common regression models that accomplish exactly this are the logit and the probit regression models.

In the logit regression model, the predicted values for the dependent variable will never be less than or equal to 0, or greater than or equal to 1, regardless of the values of the independent variables. Suppose we think of the binary dependent variable y in terms of an underlying continuous probability p, ranging from 0 to 1.

## Dependent and independent variables review

We can then transform that probability p as: In fact, if we perform the logit transform on both sides of the logit regression equation stated earlier, we obtain the standard linear regression model: You may consider the binary response variable to be the result of a normally distributed underlying variable that actually ranges from minus infinity to positive infinity. In any event, all that we the publisher of the journal will see is the binary response of renewal or failure to renew the subscription.

It is reasonable to assume that these feelings are normally distributed, and that the probability p of renewing the subscription is about equal to the relative space under the normal curve.

Therefore, if we transform each side of the equation so as to reflect normal probabilities, we obtain: The equation shown above is also referred to as the probit regression model. The term probit was first used by Bliss, General Logistic Regression Model. The general logistic model can be stated as: However, while the logit model restricts the dependent response variable to only two values, this model allows the response to vary within a particular lower and upper limit.

For example, suppose we are interested in the population growth of a species that is introduced to a new habitat, as a function of time. The dependent variable would be the number of individuals of that species in the respective habitat. Obviously, there is a lower limit on the dependent variable, since fewer than 0 individuals cannot exist in the habitat; however, there also is most likely an upper limit that will be reached at some point in time.

Drug Responsiveness and Half-Maximal Response. In pharmacology, the following model is often used to describe the effects of different dose levels of a drug: The parameter b0 then denotes the expected response at the level of dose saturation and b2 is the concentration that produces a half- maximal response; the parameter b1 determines the slope of the function.

Discontinuous Regression Models Piecewise linear regression. It is not uncommon that the nature of the relationship between one or more independent variables and a dependent variable changes over the range of the independent variables. For example, suppose we monitor the per-unit manufacturing cost of a particular product as a function of the number of units manufactured output per month.

In general, the more units per month we produce, the lower is our per-unit cost, and this linear relationship may hold over a wide range of different levels of production output. However, it is conceivable that above a certain point, there is a discontinuity in the relationship between these two variables.

For example, the per-unit cost may decrease relatively less quickly when older less efficient machines have to be put on-line in order to cope with the larger volume. Suppose that the older machines go on-line when the production output rises above units per month; we may specify a regression model for cost-per-unit as: In that case, simply replace the in the equation above with an additional parameter e.

For example, imagine that, after the older machines are put on-line, the per-unit-cost jumps to a higher level, and then slowly goes down as volume continues to increase. In that case, simply specify an additional intercept b3so that: The method described here to estimate different regression equations in different domains of the independent variable can also be used to distinguish between groups.

For example, suppose in the example above, there are three different plants; to simplify the example, let us ignore the breakpoint for now. When moving on to assumptions 3, 4, 5, 6 and 7, we suggest testing them in this order because it represents an order where, if a violation to the assumption is not correctable, you will no longer be able to use linear regression. In fact, do not be surprised if your data fails one or more of these assumptions since this is fairly typical when working with real-world data rather than textbook examples, which often only show you how to carry out linear regression when everything goes well.

Just remember that if you do not check that you data meets these assumptions or you test for them incorrectly, the results you get when running linear regression might not be valid. There needs to be a linear relationship between the dependent and independent variables. Whilst there are a number of ways to check whether a linear relationship exists between your two variables, we suggest creating a scatterplot using Stata, where you can plot the dependent variable against your independent variable.

You can then visually inspect the scatterplot to check for linearity. Your scatterplot may look something like one of the following: If the relationship displayed in your scatterplot is not linear, you will have to either run a non-linear regression analysis or "transform" your data, which you can do using Stata. There should be no significant outliers. Outliers are simply single data points within your data that do not follow the usual pattern e. The following scatterplots highlight the potential impact of outliers: The problem with outliers is that they can have a negative effect on the regression equation that is used to predict the value of the dependent variable based on the independent variable.

This will change the output that Stata produces and reduce the predictive accuracy of your results. Fortunately, you can use Stata to carry out casewise diagnostics to help you detect possible outliers.

You should have independence of observations, which you can easily check using the Durbin-Watson statistic, which is a simple test to run using Stata.

Your data needs to show homoscedasticity, which is where the variances along the line of best fit remain similar as you move along the line. The two scatterplots below provide simple examples of data that meets this assumption and one that fails the assumption: When you analyse your own data, you will be lucky if your scatterplot looks like either of the two above. Whilst these help to illustrate the differences in data that meets or violates the assumption of homoscedasticity, real-world data is often a lot more messy.

You can check whether your data showed homoscedasticity by plotting the regression standardized residuals against the regression standardized predicted value.

Finally, you need to check that the residuals errors of the regression line are approximately normally distributed. Two common methods to check this assumption include using either a histogram with a superimposed normal curve or a Normal P-P Plot. In practice, checking for assumptions 3, 4, 5, 6 and 7 will probably take up most of your time when carrying out linear regression. However, it is not a difficult task, and Stata provides all the tools you need to do this. In the section, Procedurewe illustrate the Stata procedure required to perform linear regression assuming that no assumptions have been violated.

First, we set out the example we use to explain the linear regression procedure in Stata. Stata Example Studies show that exercising can help prevent heart disease. Within reasonable limits, the more you exercise, the less risk you have of suffering from heart disease.

One way in which exercise reduces your risk of suffering from heart disease is by reducing a fat in your blood, called cholesterol. The more you exercise, the lower your cholesterol concentration. Furthermore, it has recently been shown that the amount of time you spend watching TV — an indicator of a sedentary lifestyle — might be a good predictor of heart disease i.

Therefore, a researcher decided to determine if cholesterol concentration was related to time spent watching TV in otherwise healthy 45 to 65 year old men an at-risk category of people. For example, as people spent more time watching TV, did their cholesterol concentration also increase a positive relationship ; or did the opposite happen? The researcher also wanted to know the proportion of cholesterol concentration that time spent watching TV could explain, as well as being able to predict cholesterol concentration.

The researcher could then determine whether, for example, people that spent eight hours spent watching TV per day had dangerously high levels of cholesterol concentration compared to people watching just two hours of TV.

## Computational and Mathematical Methods in Medicine

To carry out the analysis, the researcher recruited healthy male participants between the ages of 45 and 65 years old. The amount of time spent watching TV i. The example and data used for this guide are fictitious.

We have just created them for the purposes of this guide. Stata Setup in Stata In Stata, we created two variables: It does not matter whether you create the dependent or independent variable first.

Published with written permission from StataCorp LP. Stata Test Procedure in Stata In this section, we show you how to analyse your data using linear regression in Stata when the six assumptions in the previous section, Assumptionshave not been violated. You can carry out linear regression using code or Stata's graphical user interface GUI.

### Inside of the Linear Relation between Dependent and Independent Variables

After you have carried out your analysis, we show you how to interpret your results. First, choose whether you want to use code or Stata's graphical user interface GUI. Code The code to carry out linear regression on your data takes the form: You need to be precise when entering the code into the box. The code is "case sensitive".