Denunciar

VannaJoy20Seguir

19 de Sep de 2022•0 gostou•2 visualizações

19 de Sep de 2022•0 gostou•2 visualizações

Denunciar

Educação

1 A PRIMER ON CAUSALITY Marc F. Bellemare∗ Introduction This is the second of two handouts written to help students understand quantitative methods in the social sciences. This handout is dedicated to discussing (some) of the ways in which one can identify causal relationships in the social sciences. In keeping with the notation introduced in the handout on linear regression, let 𝐷𝐷 be our variable of interest; 𝑦𝑦 be an outcome of interest; and the vector 𝑥𝑥 = (𝑥𝑥1, … , 𝑥𝑥𝐾𝐾) represent other factors – or control variables – for which we have data. For the purposes of this discussion, let 𝐷𝐷 measure a given policy, 𝑦𝑦 measure welfare, and the vector 𝑥𝑥 measure the various control variables the researcher has seen fit to include. See my “A Primer on Linear Regression” for a more basic handout. Mechanics Recall that the regression of 𝑦𝑦 on (𝐷𝐷, 𝑥𝑥1, … , 𝑥𝑥𝐾𝐾) is written as 𝑦𝑦𝑖𝑖 = 𝛼𝛼 + 𝛽𝛽1𝑥𝑥1𝑖𝑖 + ⋯ + 𝛽𝛽𝐾𝐾𝑥𝑥𝐾𝐾𝑖𝑖 + 𝛾𝛾𝐷𝐷𝑖𝑖 + 𝜖𝜖𝑖𝑖, (1) where i denotes a unit of observation. In the example of wages and education, the unit of observation would be an individual, but units of observations can be individuals, households, plots, firms, villages, communities, countries, etc. Just as the research question should drive the choice of what to measure for 𝑦𝑦, 𝐷𝐷, and 𝑥𝑥, the research question also drives the choice of the relevant unit of observation. The problem is that unless the researcher runs an experiment in which she randomly assigns the level of 𝐷𝐷 to each unit of observation i, the relationship from 𝐷𝐷 to 𝑦𝑦 will not be causal. That is, 𝛾𝛾 will not truly capture the impact of 𝐷𝐷 on 𝑦𝑦, as it will be “contaminated” by the presence of unobservable factors. Some of those factors can be included in 𝑥𝑥 = (𝑥𝑥1, … , 𝑥𝑥𝐾𝐾), of course, but it is in general impossible to fully control for every relevant factor. This is especially true when unobservable or costly to observe factors (e.g., risk aversion, technical ability, soil quality, etc.) play an important role in determining 𝐷𝐷 and 𝑦𝑦. So even if we get an estimate of 𝛾𝛾 that is statistically significant, we cannot necessarily assume that the relationship between the variable of interest and the outcome variable is causal. In other words, correlation does not imply causation. For example, suppose 𝐷𝐷 is an individual’s consumption of orange juice and 𝑦𝑦 is (some) indicator of health. We have often discussed in lecture how a simple regression of 𝑦𝑦 to 𝐷𝐷 would provide us with a biased estimate of 𝛾𝛾 because orange juice consumption is nonrandom and not exogenous to health. That is, there are factors other than orange juice consumption which determine health. Some are observable (e.g., how much someone exercises; whether they smoke; their diet; etc.), but several are unobservable (e.g., their willingness to pay for orange juice; their subjective valuation of health; th

- 1. 1 A PRIMER ON CAUSALITY Marc F. Bellemare∗ Introduction This is the second of two handouts written to help students understand quantitative methods in the social sciences. This handout is dedicated to discussing (some) of the ways in which one can identify causal relationships in the social sciences. In keeping with the notation introduced in the handout on linear regression, let �� be our variable of interest; �� be an outcome of interest; and the vector �� = (��1, … , ����) represent other factors – or control variables – for which we have data. For the purposes of this discussion, let �� measure a given policy, �� measure welfare, and the vector �� measure the various control variables the researcher has seen fit to include. See my “A Primer on Linear Regression” for a more basic handout. Mechanics Recall that the regression of �� on (��, ��1, … , ����) is written as ���� = �� + ��1��1�� + ⋯ + ���������� + ������ + ����, (1) where i denotes a unit of observation. In the example of wages and education, the unit of observation would
- 2. be an individual, but units of observations can be individuals, households, plots, firms, villages, communities, countries, etc. Just as the research question should drive the choice of what to measure for ��, ��, and ��, the research question also drives the choice of the relevant unit of observation. The problem is that unless the researcher runs an experiment in which she randomly assigns the level of �� to each unit of observation i, the relationship from �� to �� will not be causal. That is, �� will not truly capture the impact of �� on ��, as it will be “contaminated” by the presence of unobservable factors. Some of those factors can be included in �� = (��1, … , ����), of course, but it is in general impossible to fully control for every relevant factor. This is especially true when unobservable or costly to observe factors (e.g., risk aversion, technical ability, soil quality, etc.) play an important role in determining �� and ��. So even if we get an estimate of �� that is statistically significant, we cannot necessarily assume that the relationship between the variable of interest and the outcome variable is causal. In other words, correlation does not imply causation. For example, suppose �� is an individual’s consumption of orange juice and �� is (some) indicator of health. We have often discussed in lecture how a simple regression of �� to �� would provide us with a biased estimate of �� because orange juice consumption is nonrandom and not exogenous to health. That is, there are factors other than orange juice consumption which determine health. Some are observable (e.g., how much someone exercises; whether they smoke; their diet; etc.), but several are unobservable (e.g., their willingness to pay for orange juice; their subjective valuation of health; their level of risk aversion; their
- 3. genes; etc.) Thus, it really isn’t sufficient to run a kitchen-sink regression (i.e., a regression in which everything observable is thrown in as a control) to properly identify the causal impact of �� on ��. ∗ Associate Professor, Department of Applied Economics, and Director, Center for International Food and Agricultural Policy, University of Minnesota, 1994 Buford Ave, Saint Paul, MN 55113, [email protected] This is the August 2017 version of this handout. mailto:[email protected] 2 Identification So how do we identify causality? The best way to do so is to run a randomized controlled trial (RCT), which we have discussed in lecture. In this case, the idea would be to get a random sample of individuals of size �� and to assign half of the sample (i.e., �� 2⁄ ) to a control group and half to a treatment group. The latter group would be told to consume, say, one glass of orange juice every morning, and the other half would be told not to do so. Then, after a suitable period of time, we would compare the mean of �� between groups. The null hypothesis would of course be that the mean health of the treatment group is equal to the mean health of the control group. A rejection of the null in favor of finding that the mean health of the treatment group is higher than the mean health of the control group would then be evidence in favor of the hypothesis that orange juice is good for one’s health. More than that – it would
- 4. be evidence in favor that orange juice consumption causes good health. The problem is that it is not always possible to run an RCT, and even the simple example described above would be subject to important problems. For example, the individuals in the treatment group may not comply with the experimenters instructions, especially if they don’t like orange juice. More generally, they may simply forget to consume orange juice every morning. Likewise, the individuals in the control group may end up inadvertently consuming orange juice when they are not supposed to. These reasons – and others – would contaminate one’s estimate of �� in equation 1 and would invalidate the test of equality of means described above. So what is one to do? Instrumental Variables Estimation When one only has observational (i.e., nonexperimental) data at one’s disposal, the best way to identify causality is to find an instrumental variable (IV) for the endogenous variable. In the example above, the endogenous variable is ��, which is said to be endogenous to ��. What is an IV? It is a variable �� that is (i) correlated with ��; but (ii) uncorrelated with �� and which is used to make �� exogenous to ��. How does an IV exogenize an endogenous variable? By virtue of being correlated with the endogenous variable, yet uncorrelated with the error term, which is the definition of an instrument. I realize that this sounds tautological, so for example, Angrist (1990) studies the impact of education (��)
- 5. on wages (��). The problem is that education is endogenous to wage, if anything because people acquire education in expectation of the wage they think this will get them. In other words, even if we find a positive coefficient for education in a regression of wage on explanatory variables, this is merely a correlation, and it does not necessarily indicate that education causally affects wages. To instrument for this, Angrist had to find a variable that would be correlated with how much education someone would get, but uncorrelated with anything unobserved and would affect wage only through how much education they acquire. The instrument he settled upon was an individual’s Vietnam draft lottery number, since this correlates with whether one goes to war and is then subject to the GI Bill, but since those numbers are randomly generated, they are uncorrelated with unobservables. How does IV estimation work, mechanically speaking? Recall that our equation of interest is 3 ���� = �� + ��1��1�� + ⋯ + ���������� + ������ + ����. (1) The way IV estimation proceeds is to first regress the endogenous variable �� on the instrument �� as well as on the control variables in �� = (��1, … , ����), such that ���� = �� + ��1��1�� + ⋯ + ���������� +
- 6. ������ + ����. (2) Once equation 2 is estimated, it is possible to predict the variable ��, whose prediction we label ��� (the circumflex accent – or “hat” – denotes a predicted variable in econometrics) and to then estimate equation 1 as follows ���� = �� + ��1��1�� + ⋯ + ���������� + ������� + ����. (1’) Note what has been done here: we have replaced the endogenous variable with an exogenized version of the same variable. The way it has been exogenized has been by regressing it on the IV, which is exogenous to the outcome of interest, and to obtain its predicted value, which we then use in lieu of the original endogenous variable. The first requirement of an instrument – i.e., that it be correlated with �� – is easily testable: we only need to check that the coefficient �� in equation 2 is significantly different enough from zero. The second requirement of an instrument – i.e., that it only affect the outcome of interest �� through the treatment variable – cannot be tested for. Rather, one must make the case that it is truly exogenous to the outcome of interest. This is easier said than done in most cases, as some people have devoted entire careers to finding good IVs. References Angrist, Joshua D. (1990), “Lifetime Earnings and the Vietnam Era Draft Lottery: Evidence from the Social Security Administrative Records,” American Economic Review
- 7. 80(3): 313-336. IntroductionMechanicsIdentificationInstrumental Variables EstimationReferences Therapeutic Communication 1. In the movie, Shutter Island, Dr. Crawley asked Teddy (Andrew) to explain what happened when he discovered his deceased children? What therapeutic communication is this an example of? a. Closed ended questions b. Using silence c. Giving broad opening d. Open ended questions Rationale: Opened ended questions is correct because this i s giving Andrew a chance to answer with more than a simple yes or no. Using silence is incorrect because he is not being quiet. Closed ended questions is incorrect because he cannot answer the question with just a yes or no answer. Giving broad
- 8. opening is incorrect because Dr. Crawley did ask Andrew to pick the topic and express his thoughts. 2. Dr. Crawley told Andrew if he had another episode, he would be lobotomized. In the ending of the movie, Teddy called Dr. Sheehan, Chuck. Dr. Sheehan nods his head at Dr. Crawley giving a signal. What therapeutic communication is this? a. Making observations b. Presenting reality c. Offering self d. Accepting Rationale: Making observations is correct because Dr. Sheehan observed that Teddy called him Chuck, verifying that he was having another episode. Presenting reality is incorrect because he was not trying to bring him back into reality. Offering self is incorrect because the doctors are not “interested” into Teddy. Accepting is incorrect because they are not accepting Teddy behavior as normal. 3. In the movie, Shutter Island, the staff of Ashecliffe allowed Andrew to play out the role of Teddy hoping to cure his conspiracy insanity. This is an example of what kind of therapeutic communication?
- 9. a. Accepting b. Exploring c. Engaging into fantasy d. Denial Rationale: Engaging into fantasy is correct because the staff played into his fantasy of finding Andrew Laeddis. Exploring is incorrect because exploring is delving further into a subject, idea, experience, or relationship. Denial is incorrect because no one was in denial about his state of mind. Accepting is incorrect because they know this act is only to cure his disease. 4. Which question shows an example of the therapeutic communication placing events in sequence? a. “You feel angry when he doesn’t help.” b. “Are you feeling…” c. “What could you do to let your anger out harmlessly?” d. “Will you please tell me more about the situation with all the details?”
- 10. Rationale: “Will you please tell me more about the situation with all the details” is an example of placing events in sequence because it will tell you more about the situation and you can piece together when they happened. The others are incorrect because “You feel angry when he doesn’t help” is an example of focusing, “Are you feeling…” is an example of verbalizing the implied and, “What could you do to let your anger out harmlessly?” is an example of formulating a plan of action. 5. Which of the following statements by Andrew Laeddis shows the therapeutic communication of understanding? a. I feel ok. b. This is a very difficult situation. c. Yes, I understand that Teddy Daniels does not exist. d. You are not listening to me. Rationale: “Yes, I understand that Teddy Daniels does not exist” is an example of understanding because it coveys an attitude of receptivity and regard. The other statements “I feel ok”, “This is a very difficult situation”, and “You are not listening to me” are incorrect
- 11. because they are responses of implied questions. Defense Mechanism 1. What defense mechanism does Teddy exhibit in the movie Shutter Island? a. Sublimination b. Rationalization c. Projection d. Fantasy Rationale: Teddy is exhibiting fantasy because he is gratifying frustrated desires by imaginary achievements. In the movie Teddy is creating his own fantasy world where he is still a US Marshall, and he creates his own story of what happened to his wife because he does not want to face the reality of what really happened. Sublimination is incorrect because he is not channeling an unacceptable impulse in a socially acceptable direction. Rationalization is incorrect because he is not trying to justify attitudes, beliefs, or behaviors. Projection is incorrect because he is not attributing his own unacceptable behavior unto someone else.
- 12. NUR 114 Nursing Concept II TYPES OF QUESTIONS • Open questions These are useful in getting another person to speak. They often begin with the words: What, Why, When, Who Sometimes they are statements: “tell me about”, “give me examples of”. They can provide you with a good deal of information. • Closed questions These are questions that require a yes or no answer and are useful for checking facts. They should be used with care - too many closed questions can cause frustration and shut down conversation. • Specific questions These are used to determine facts. For example “How much did you spend on that” • Probing questions
- 13. These check for more detail or clarification. Probing questions allow you to explore specific areas. However be careful because they can easily make people feel they are being interrogated . • Hypothetical questions These pose a theoretical situation in the future. For example, “What would you do if…?’ These can be used to get others to think of new situations. They can also be used in interviews to find out how people might cope with new situations. • Reflective questions You can use these to reflect back what you think a speaker has said, to check understanding. You can also reflect the speaker’s feelings, which is useful in dealing with angry or difficult people and for defusing emotional situations. • Leading questions. These are used to gain acceptance of your view – they are not useful in providing honest views and opinions. If you say to someone ‘you will be able to cope, won’t you?’ they may not like to disagree. You can use a series of different type of questions to “funnel” information. This is a way of structuring information in sequence to explore a topic and to get to the heart of the issues. You may use an open question, followed by a probing question, then a specific question and a reflective question.
- 14. 1 A PRIMER ON LINEAR REGRESSION Marc F. Bellemare∗ Introduction This set of lecture notes was written to allow you to understand the classical linear regression model, which is one of the most common tools of statistical analysis in the social sciences. Among other things, a regression allows the researcher to estimate the impact of a variable of interest �� on an outcome of interest �� holding other included factors �� = (��1, … , ����) constant. For the purposes of this discussion, let �� measure a given policy, �� measure welfare, and the vector �� measure the various control variables the researcher has seen fit to include. Example For example, one might be interested in the impact of individuals’ years of education �� on their wage �� while controlling for age, gender, race, state, sector of employment, etc. in ��. Generally, social science research is interested in the impact of a specific variable of interest on an outcome of interest, i.e., in the impact of �� on ��. Mechanics The regression of �� on (��, ��1, … , ����) is typically written as
- 15. ���� = �� + ��1��1�� + ⋯ + ���������� + ������ + ����, (1) where i denotes a unit of observation. In the example of wages and education, the unit of observation would be an individual, but units of observations can be individuals, households, plots, firms, villages, communities, countries, etc. Just as the research question should drive the choice of what to measure for ��, ��, and ��, the research question also drives the choice of the relevant unit of observation. When estimating equation 1, the researcher will have data on �� units of observations, so �� = 1, … , ��. Alternatively, we say that �� is the sample size. For each of those �� units, the researcher will have data on ��, ��, and ��. In other words, we will ignore the problem of missing data, as observations with missing data are usually dropped by most statistical packages. The role of regression analysis is to estimate the coefficients (��, ��1, … , ����, ��). To differentiate the “true” coefficients from coefficient estimates, we will use a circumflex accent (i.e., a “hat”) to denote estimated coefficients. Therefore, the estimated (��, ��1, … , ����, ��) will be denoted (���, �̂��1, … , �̂����, ���). Going back to our interest in estimating the impact of �� on �� at the margin, this impact is represented in the context of equation 1 by the parameter ��. Indeed, if you remember your partial derivatives, the marginal ∗ Associate Professor, Department of Applied Economics, and Director, Center for International Food and
- 16. Agricultural Policy, University of Minnesota, 1994 Buford Ave, Saint Paul, MN 55113, [email protected] This is the August 2017 version of this handout. mailto:[email protected] 2 impact of �� on �� at is equal to ���� ���� = ��. Moreover, the partial derivative is such that only �� varies. In other words, �� measures the impact of a change in �� on �� holding everything else constant, or ceteris paribus. In this case, what we mean by “everythi ng else” is limited only to the factors that are included in the vector �� of control variables. Whatever is not included among the variables �� is not held constant by regression analysis. Indeed, the relationship in equation 1 is not deterministic in the sense that even if we have data for ��, ��, and �� and credible parameter estimates (���, �̂��1, … , �̂����, ���), we will still not be able to perfectly forecast for ��. That is because there are several things about any given problem that we, as social science researchers, do not observe and are not privy to. Individuals have intrinsic motivations that even they may have difficulty expressing. Individuals make errors. Individuals experience unforeseen events. There are factors which are very important in determining �� but which we simply do not observe.
- 17. For all these reasons, we add an error term �� at the end of equation 1. The error term simply represents our ignorance about the problem. As such, it includes all of the things that we did not think of including on the right-hand side of equation 1, as well as all of the things that we could not include on the right-hand side of equation 1. The error term �� thus embodies our ignorance about the relationship between two variables. So how does a linear regression actually work? To take an example I know well, suppose we are looking at only two variables: rice yield (i.e., kg/are), which will be our outcome of interest �� since it represents agricultural productivity, and cultivated area (number of ares, or hundredths of a hectare, or 100 square meters), which will be our variable of interest ��. Indeed, the inverse relationship between farm or plot size and productivity has been a longstanding empirical puzzle in development microeconomics (Barrett et al., 2010). So, plotting some data on this question, we get the following figure. The scatter plot in figure 1 directly shows that the relationship between yield and cultivated area is not deterministic. That is, the relationship between the two variables is not a straight line, and the fact that the relationship is scattered indicates that there are other factors besides cultivated area that contributed to determining rice productivity. The role of the regression – and, as we will soon understand, of the error term – is to linearly approximate as best as possible the relationship between two variables. In other words, to do something that looks like the red line in figure 2.
- 18. 3 Figure 1. Rice Productivity Scatter Figure 2. Rice Productivity Scatter and Regression Line 1 2 3 4 5 Y ie ld -2 0 2 4 6 Cultivated Area 1 2 3 4 5
- 19. -2 0 2 4 6 Cultivated Area Yield Fitted values 4 Note that we indeed find an inverse relationship between plot size and productivity, since the regression line slopes downward, which means that in this context, ��� < 0. Indeed, running a simple regression of rice yield on cultivated area yields ��� = 4.187 (and we can see from the graph that 4.187 would indeed be the value of rice productivity at a cultivated area of zero ares) and ��� = −0.356, with both coefficients statistically different from zero at less than the 1 percent level (i.e., there is a less than one percent chance �� and �� are no different from zero). In other words, the finding here is that on average, �� = 4.187 − 0.356��, (2) or for every 1 percent increase in cultivated area, rice productivity decreases by 0.356 percent. This may seem counterintuitive, but remember that productivity is not total output – it is only a measure of average productivity on the plot. So how does the apparatus of the linear regression determine the value of the intercept and the value of the slope of the regression line in figure 2? This is where our error term comes into play. Indeed, a linear
- 20. regression will choose, among all possible lines, the one that minimizes the sum of the distances between each point in the scatter and the line itself, under the assumption that the error is on average equal to zero (i.e., that our predictions are right on average). Assuming that the error term is equal to zero on average and minimizing the sum of all point-line distances (technically, the sum of squared errors) allows us to obtain estimates ��� and ��� of the true parameters �� and ��. A few remarks are in order. First off, note that the constant term (or the intercept) �� does not have an economic interpretation in this case, since a cultivated area of zero really entails a yield of zero. Second, since the only factor we included on the right-hand side of equation 1 was cultivated area, the error term includes a lot of things which may be potentially crucial in determining yield. For example, the plot’s position on the toposequence, the quality of the soil, the source of irrigation of the plot, various characteristics of the household operating the plot, etc. So because we are typically interested in the impact of �� on �� controlling for a number of factors ��, regression results will not be presented in the form of figure 2. Indeed, regression results are typically presented in the form of table 1 at the end of this document. How do we interpret table 1? First off, note that N = 466. That is, we have data on 466 plots. The first column tells us what variables are included on the right-hand side of equation 1, viz. cultivated area; land value; total land owned by the household; household size (number of individuals); household dependency ratio (proportion of dependents within the household); whether the household head is a single female or a single male; whether the plot is irrigated by a dam, a spring, or
- 21. rainfed; soil quality measurements (carbon, nitrogen, potassium percentages; soil pH; clay, silt, and sand percentages); and an intercept. The second column shows the estimated coefficients for the first specification of equation 1 (in this case, a pooled cross - section of all the plots and all the households, i.e., a specification which ignores the fact that some households own more than just one plot in the sample); and the third column shows the standard errors around each estimated coefficient. These standard errors are used to determine whether each coefficient is statistically significantly different from zero or not. To make life simpler, table 1 shows whether coefficients are significant at the 10, 5, or 1 percent levels by using the symbols *, **, and *** respectively. Note that in all cases, there is a (significant) inverse relationship between cultivated area and rice productivity. 5 Taking column 1 as an example of how to interpret regression results, what can we say? First and foremost, note that for a 1 percent increase in cultivated area, there is an associated productivity decrease of 0.27 percent (alternatively, a doubling of the size of the plot would be associated with a 27-percent decrease in productivity). Moreover, we can note three things. First off, the more valuable a plot, the more productive it is; second, plots irrigated by a dam are more productive than plots without any irrigation; and third, plots irrigated by a spring are more productive than plots without any
- 22. irrigation. In fact, comparing the magnitude of the coefficient estimates for irrigation by a dam and irrigation by a spring, we see that the impacts of these two types of irrigation are essentially the same. Another thing of note in table 1 is how the coefficient on the variable of interest changes depending on what is included on the right-hand side of equation 1. Comparing the first two specifications (i.e., pooled cross- section vs. household fixed effects), note how the magnitude of the inverse relationship between productivity and cultivated area is reduced from -0.271 to - 0.176 when household fixed effects (i.e., controls for household-specific unobservables characteristics, which is made possible here because there are 286 households for 466 plots; in other words, there are some households who own more than one plot in the sample) are included. This indicates that a great deal of the inverse relationship can be attributed to household-specific, otherwise unobservable factors. Likewise, comparing specifications 1 and 3 (i.e., pooled cross-section vs. soil quality), note again how the magnitude of the inverse relationship between productivity and cultivated area is reduced from -0.271 to -0.265 when soil quality measurements are included. Overall, this indicates that household-specific, unobserved factors are more important in driving the inverse relationship than the omission of soil quality measurements. In any event, a comparison of specifications 1 and 2 and of specifications 1 and 3 point to an important endogeneity problem (in this case, an omitted variables problem) caused by the omission, respectively, of household fixed effects and of soil quality measurements.
- 23. References Barrett, Christopher B., Marc F. Bellemare, and Janet Y. Hou (2010), “Reconsidering Conventional Explanations of the Inverse Productivity—Size Relationship,” World Development 38(1): 88-97. 6 Table 1 – Yield Approach Estimation Results (n=466) (1) Pooled Cross-Section (2) Household Fixed Effects (3) Soil Quality (4) Household Fixed Effects and Soil Quality Variable Coefficient (Std. Err.) Coefficient (Std. Err.) Coefficient (Std. Err.) Coefficient (Std. Err.)
- 24. Dependent Variable: Rice Yield (Kilograms/Are) Cultivated Area -0.271*** (0.038) -0.176*** (0.046) -0.265*** (0.048) -0.187*** (0.052) Total Land Area -0.055 (0.038) -0.054 (0.047) Land Value 0.183*** (0.031) 0.303*** (0.069) 0.176*** (0.032) 0.287*** (0.063) Household Characteristics Household Size -0.007 (0.009) -0.008 (0.008) Dependency Ratio -0.073 (0.130) -0.083 (0.144) Single Female -0.056 (0.111) -0.070 (0.119) Single Male 0.133 (0.134) 0.122 (0.155) Plot Characteristics Irrigated by Dam 0.389** (0.171) 0.228 (0.202) 0.450** (0.211) 0.545 (0.402) Irrigated by Spring 0.365** (0.175) 0.250 (0.214) 0.425** (0.214) 0.541 (0.389) Irrigated by Rain 0.184 (0.180) 0.024 (0.217) 0.249 (0.220) 0.313 (0.431) Soil Quality Measurements Carbon -1.361 (1.510) -0.001 (1.844) Nitrogen 1.668 (1.781) -0.007 (2.750) pH -1.064 (7.163) -17.969 (14.459) Potassium 1.183 (1.412) -5.528* (3.035) Clay 0.293 (3.115) -5.183 (4.174) Silt 0.521 (5.751) 4.485 (11.681) Sand -0.135 (5.607) 5.261 (7.106) Intercept -1.847*** (0.372) -3.162*** (0.694) -2.276 (1.552) - 3.328*** (0.819) Number of Households – 286 – 286 Bootstrap Replications – – 500 500 Village Fixed Effects Yes Dropped Yes Dropped
- 25. R2 0.45 0.97 0.46 0.97 p-value (All Coefficients) 0.00 0.00 0.00 0.00 p-value (Fixed Effects) – 0.00 – 0.00 p-value (Soil Quality) – – 0.79 0.52 ***, ** and * indicate statistical significance at the one, five and ten percent levels, respectively. IntroductionExampleMechanicsReferences Ordinary Least-Squares Ordinary Least-Squares Ordinary Least-Squares Ordinary Least-Squares Ordinary Least-Squares One-dimensional regression x y Ordinary Least-Squares One-dimensional regression y = ax
- 26. Find a line that represent the ”best” linear relationship: x y Ordinary Least-Squares One-dimensional regression iiie = y - x a • Problem: the data does not go through a line iiy - x a x y Ordinary Least-Squares One-dimensional regression iiie = y - x a • Problem: the data does not go through a line
- 27. • Find the line that minimizes the sum: i iiå(y - x a)2 iiy - x a x y Ordinary Least-Squares One-dimensional regression x ̂ iiie = y - x a i i = å(y -x a )2e(a) • Problem: the data does not go through a line • Find the line that minimizes the sum: • We are looking for that minimizes i iiå(y - x a)2
- 28. iiy - x a x y Ordinary Least-Squares Multidimentional linear regression Using a model with m parameters å= j jjmm xay = x a + ...+ x a11 Ordinary Least-Squares Multidimentional linear regression Using a model with m parameters 2x å= j jjmm xay = a x + ...+ a x11 1
- 29. x y Ordinary Least-Squares Multidimentional linear regression Using a model with m parameters 2a å=++= j jjmm xaxaxab ...11 1a b Ordinary Least-Squares Multidimentional linear regression Using a model with m parameters and n measurements å=+ j
- 30. jjmm xaxay = a x + ...11 2 2 1 , 2 1 1 , )( y (a) y - Ax= ú û ù ê ë é -= -= å å å
- 31. = = = m j jji n i m j jjii xa xaye Ordinary Least-Squares Multidimentional linear regression Using a model with m parameters and n measurements å=+ j jjmm xaxay = a x + ...11
- 32. 2 2 1 , 2 1 1 , )( (a) y (a) y - Ax= ú û ù ê ë é -= -= å å å
- 33. = = = e x a x aye m j jji n i m j jjii Ordinary Least-Squares Minimizing e(a) amin minimizes e(a) if Ordinary Least-Squares
- 34. x y ei no errors in ai Ordinary Least-Squares x y ei x y ei no errors in xi errors in xi Ordinary Least-Squares x y homogeneous errors
- 35. Ordinary Least-Squares x y x y homogeneous errors non-homogeneous errors Ordinary Least-Squares x y no outliers Ordinary Least-Squares x y x y
- 36. no outliers outliers outliers