## Group Exercise: Building a Causal Model

Use the CILS data to test a sociological causal model. Calculate the least squares or logistic regression, as appropriate. Interpret the results.

``` CILS2010<-read.csv("http://www.courseserve.info/files/CILS2010.csv") attach(CILS2010)```
``` Eng=0; Eng<-V24+V25+V26+vV27 Eng2=0; Eng2[Eng>15]<-1; Eng2[Eng<=15]<-0 # Least squares linear regression summary(lm(V60~Eng+V18+V19+V55+V72+V44+V11+V17)) Call: lm(formula = V60 ~ Eng + V18 + V19 + V55 + V72 + V44 + V11 + V17) Residuals: Min 1Q Median 3Q Max -3.4792 -0.3826 0.2338 0.4215 1.4193 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.9532905 0.2894521 13.658 < 2e-16 *** Eng 0.0790067 0.0092637 8.529 < 2e-16 *** V18 0.2221707 0.0272793 8.144 5.86e-16 *** V19 -0.0635706 0.0158466 -4.012 6.20e-05 *** V55 0.0017498 0.0594642 0.029 0.9765 V72 -0.0013663 0.0007504 -1.821 0.0688 . V44 -0.0399862 0.0165164 -2.421 0.0155 * V11 0.0960830 0.0400204 2.401 0.0164 * V17 0.0193007 0.0382207 0.505 0.6136 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.6816 on 2598 degrees of freedom (2655 observations deleted due to missingness) Multiple R-squared: 0.07556, Adjusted R-squared: 0.07271 F-statistic: 26.54 on 8 and 2598 DF, p-value: < 2.2e-16 # Logistic regression Degree=0; Degree[V60>=4]<-1; Degree[V60<4]<-0 summary(glm(Degree~Eng+V18+V19+V55+V72+V44+V11+V17,family="binomial")) Call: glm(formula = Degree ~ Eng + V18 + V19 + V55 + V72 + V44 + V11 + V17, family = "binomial") Deviance Residuals: Min 1Q Median 3Q Max -2.8453 0.2487 0.3098 0.3961 1.5044 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.159578 1.586683 1.361 0.173493 Eng 0.269895 0.041050 6.575 4.87e-11 *** ```
``` Professor Shortell's blog Read more ```
``` In Anticipation of the Final Examination Wed, 12/04/2013 - 6:42pm — Professor Shortell Use the comments to this page to suggest topics to include in our review session for the final examination. The review session will be M 12/16. The exam is due M 12/23. Professor Shortell's blog Group Exercise: Logistic Regression 2 Mon, 12/02/2013 - 4:39pm — Professor Shortell Your task is to improve upon the model in the example we discussed (predicting the odds of identifying as a Republican) by adding some attitudinal variables and other demographics. You may need to do some recoding to create binary variables. Explain your model and discuss the results. Perform the significance test using the anova() function to see if your model fits significantly better than the example. Post your interpretation as a comment to this page. BONUS point to the group with the best fitting model. As you may recall, the form for the significance test is: anova(glm(repub~spending+married+dem_unionhh+black+hispanic+collegegrad+northeast+northcentral+south,family="binomial"), glm(repub~spending+married+dem_unionhh+black+hispanic+collegegrad+northeast+northcentral+south+YOUR_ADDITIONAL_VARIABLES_HERE,family="binomial"), test="Chisq") Professor Shortell's blog More on Logistic Regression Mon, 12/02/2013 - 4:37pm — Professor Shortell # We'll look at another example of logistic regression and # model specification. This time, we'll use the ANES 2012. ANES2012<-read.csv("http://www.courseserve.info/files/ANES2012r.csv") attach(ANES2012) # First, we'll create a binary vector for party affiliation # that compares just Democrats and Republicans. With the # new variable, "repub", we'll be building a model that # tries to predict the characteristics associated with the # odds of identifying as a Republican rather than a Democrat. repub=0; repub[pid_self==1]<-0; repub[pid_self==2]<-1 # Now, we'll specify our base model. # We'll use the index that measures attitude toward federal # spending that we created earlier. The source code is here: source("http://www.courseserve.info/files/SOCY7112spending.r") # Let's create a binary vector for marital status (married, not married): married=0; married<-ifelse(dem_marital==1,1,0) # We'll use a pair of binary vectors to measure race. The # code is here: source("http://www.courseserve.info/files/SOCY7112race.r") # We'll use binary vectors for education and region: source("http://www.courseserve.info/files/SOCY7112degree.r") source("http://www.courseserve.info/files/SOCY7112anesregion.r") # Now we can run the base model. summary(glm(repub~spending+married+dem_unionhh+black+hispanic+collegegrad+northeast+northcentral+south,family="binomial")) Professor Shortell's blog Group Exercise: Logistic Regression Mon, 11/25/2013 - 7:26pm — Professor Shortell Select a binary dependent variable that you want to explain. Select five or more independent variables. You may need to recode variables before you can include them in your model. If you use USCounties, you can use the R scripts I've prepared to create binary vectors for EconType04 and Region. Use the source("...") function: http://www.courseserve.info/files/SOCY7112factors.r -- for EconType04 http://www.courseserve.info/files/SOCY7112regions.r -- for Region Perform the hypothesis test and interpret the results. Professor Shortell's blog Introduction to Logistic Regression Mon, 11/25/2013 - 6:36pm — Professor Shortell We use logistic regression when we want to model a dependent variable that is binary. (Ordinary least squares regression requires a numeric dependent variable, as you will recall.) Let's consider an example where we are trying to account for why some US counties experience persistent poverty. Note that the regression function is glm() rather than lm(). With glm() -- the generalized linear model -- we need to specify which distribution to use for the test; since we are testing a binary dependent variable, we specify family="binomial". The regression coefficients are expressed in logits -- that is, log of odds ratios. It is useful to convert the coefficients back to odds, which we do with exp(coef(glm(...))). The goodness of fit is expressed by deviance (or, -2LL). We use ANOVA to test for the significance of the model; we compare the empty model, using only the intercept and no predictors, as the comparison; then we compare the full model to see if it accounts for more of the variation in the dependent variable. When it makes sense to do so, we can test a partial model (instead of the empty model) and the full model. In this case, we are testing for the increase in explanatory power when we add the region factor. http://www.courseserve.info/files/SOCY7112logistic.session Professor Shortell's blog Group Exercise: Comparing Regression Models Mon, 11/18/2013 - 6:14pm — Professor Shortell Identify a numeric dependent variable. Select at least five potential independent variables. You may need to create an index or create binary variables. Identify the independent variable that you think is the most important causal factor. First, calculate the regression model without the variable you selected as most important. Note the goodness of fit of the incomplete model. Next, calculate the model with the most important variable included (along with the others). Note the change in goodness of fit. This is the causal contribution of the most important variable (as you identified it). Interpret the results, if appropriate, of the second model. Post your results as a comment to this page. Include everyone's name in your group. Professor Shortell's blog Recoding and Indexing Mon, 11/18/2013 - 6:08pm — Professor Shortell # We'll create a simple index variable as a sum of a series of related # variables. This makes sense if all the variables are on the same # answer set and if we can identify an order to the answer set -- that # is, the level of measurement is ordinal. # # It will be necessary to recode the variables in this example prior # to creating the sum score. ANES2012<-read.csv("http://www.courseserve.info/files/ANES2012r.csv") attach(ANES2012) # We'll use a series of questions measuring attitude toward federal # spending. The original variables are measured on a three point answer # set, where 1=increase, 2=decrease, and 3=keep same. We can consider # this an ordinal scale, from liberal to conservative (with regard to # the role of government), if we recode to put the 'keep same' answer # in the middle. spend1=0; spend1[fedspend_ss==1]<-2; spend1[fedspend_ss==2]<-0; spend1[fedspend_ss==3]<-1 spend2=0; spend2[fedspend_schools==1]<-2; spend2[fedspend_schools==2]<-0; spend2[fedspend_schools==3]<-1 spend3=0; spend3[fedspend_scitech==1]<-2; spend3[fedspend_scitech==2]<-0; spend3[fedspend_scitech==3]<-1 spend4=0; spend4[fedspend_crime==1]<-2; spend4[fedspend_crime==2]<-0; spend4[fedspend_crime==3]<-1 spend5=0; spend5[fedspend_welfare==1]<-2; spend5[fedspend_welfare==2]<-0; spend5[fedspend_welfare==3]<-1 spend6=0; spend6[fedspend_child==1]<-2; spend6[fedspend_child==2]<-0; spend6[fedspend_child==3]<-1 spend7=0; spend7[fedspend_poor==1]<-2; spend7[fedspend_poor==2]<-0; spend7[fedspend_poor==3]<-1 spend8=0; spend8[fedspend_enviro==1]<-2; spend8[fedspend_enviro==2]<-0; spend8[fedspend_enviro==3]<-1 # Now we can create an index variable by summing the new spendX variables. spending<-spend1+spend2+spend3+spend4+spend5+spend6+spend7+spend8 # Let's look at another example of recoding. married=0; married<-ifelse(dem_marital==1,1,0) # Now let's build a least squares regression model to predict attitude toward spending. Professor Shortell's blog Read more Group Exercise: Multiple Least Squares Regression Mon, 11/11/2013 - 6:42pm — Professor Shortell Select a DV related to party affiliation or political view. (Remember, your DV must be numeric.) Identify at least five IVs that you believe will predict your DV. (Remember, your IVs must be numeric or binary.) Perform the hypothesis test and interpret the results. Discuss your model specification. Post your work as a comment to this page. Include the names of all group members in your post. Professor Shortell's blog An Example of Correlation and Regression Mon, 11/11/2013 - 6:34pm — Professor Shortell I've set up the most recent edition of the ANES. Let's look at an example. > ANES2012<-read.csv("http://www.courseserve.info/files/ANES2012r.csv") > attach(ANES2012) > summary(ftgr_tea) > plot(ftgr_rich,ftgr_tea) > abline(lm(ftgr_tea~ftgr_rich)) > cor.test(ftgr_tea,ftgr_rich,test="pearson") > summary(lm(ftgr_tea~ftgr_rich)) > summary(lm(ftgr_tea~ftgr_rich+ft_rep+ft_dem+ftgr_xfund+ftgr_bigbus+ftgr_fedgov+gender_respondent)) It is clear that linear regression is more informative than correlations, and that multiple least squares regression is a sophisticated modeling tool. We'll practice with some additional examples. Some things to consider: The DV must be numeric The IVs (predictors) can be numeric or binary Predictors should not be highly correlated (above 0.9, for example) Model specification is very important I've created a R script to create a series of binary variables for race. To run the script, at the prompt, type: source("http://www.courseserve.info/files/SOCY7112racevectors.r") and hit ENTER. Professor Shortell's blog Read more Update to Examination 2 Mon, 11/11/2013 - 4:45pm — Professor Shortell I have deleted question 16 (about the table in Langford and Mackinnon) from the assignment. I should have moved the reading to the next section, where it would make more sense. I won't grade that question, so you can leave it blank. Professor Shortell's blog Group Exercise: Factorial Analysis of Variance Mon, 11/04/2013 - 6:47pm — Professor Shortell We'll use the ANES for this exercise. Find a new dependent variable (numeric) to test for the effects of gender (V081101) and region (v081204). Perform the hypothesis test. Interpret the results. Post your response as a comment to this page. Remember to include the names of everyone in your group. Professor Shortell's blog Group Exercise: Analysis of Variance Mon, 10/28/2013 - 7:55pm — Professor Shortell Pick a numeric DV (one of the Feeling Thermometers) and compare by Party Affiliation (V083097). Perform the hypothesis test. Professor Shortell's blog Group Exercise: Hypothesis Testing with Means Mon, 10/21/2013 - 6:08pm — Professor Shortell Perform a one group or two group hypothesis test with means. State the research and null hypotheses. Produce the table with R and discuss the results. What would be a good follow-up hypothesis? Post your work as a comment to this page. Remember to include everyone's name. Professor Shortell's blog Group Exercise: Confidence Interval Tue, 10/15/2013 - 3:00pm — Professor Shortell We can estimate the confidence interval for a mean using the t.test function. You need, of course, a numeric variable to calculate a mean. Identify the variable and a population value, mu, to test for (which should be derived from theory or make sense at face value). In the World95 data, we can use the percent population change variable to test for a population increase. > t.test(pop_incr, mu=0) This results in: One Sample t-test data: pop_incr t = 14.667, df = 108, p-value < 2.2e-16 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: 1.455019 1.909752 sample estimates: mean of x 1.682385 We can also test for a population proportion, using the prop.test function. In this example, we'll use the variable Q2 from the CBS2011 dataframe. (This variable asks respondents if they think the country is going in the right direction, 1, or the wrong direction, 2.) We can see the table with the CrossTable function. > CrossTable(Q2) Which gives us: Cell Contents |-------------------------| | N | Professor Shortell's blog Read more Group Exercise: Odds Ratio Mon, 10/07/2013 - 6:14pm — Professor Shortell Identify two binary variables in one of our dataframes. (You want to be able to test a causal relationship, so identify one as the potential cause and the other as the effect.) Calculate the odds ratio and Fisher test and interpret the results. Codebooks: http://www.courseserve.info/files/ABC2010.pdf http://www.courseserve.info/files/CBS2011.pdf http://www.courseserve.info/files/ANES2008_Codebook.pdf Post your work as a comment to this page. Remember to include the names of everyone in your group. Professor Shortell's blog Group Exercise: Central Tendency and Variability Mon, 10/07/2013 - 5:32pm — Professor Shortell Identify one numeric variable in one of our dataframes. Calculate the measures of central tendency and dispersion and interpret the results. Codebooks: http://www.courseserve.info/files/ABC2010.pdf http://www.courseserve.info/files/CBS2011.pdf http://www.courseserve.info/files/ANES2008_Codebook.pdf http://www.courseserve.info/files/world95.pdf http://www.courseserve.info/files/PUMS2000.pdf http://www.courseserve.info/files/USCounties.pdf Post your work as a comment to this page. Remember to include the names of everyone in your group. Professor Shortell's blog Group Exercise: Graphing Mon, 09/30/2013 - 3:17pm — Professor Shortell Produce a scatterplot of two numeric variables and describe the relationship. (Bonus for adding the the regression line to the graph. Attach your graph as a PNG or JPG file to your comment.) Codebooks: http://www.courseserve.info/files/ABC2010.pdf http://www.courseserve.info/files/CBS2011.pdf http://www.courseserve.info/files/ANES2008_Codebook.pdf http://www.courseserve.info/files/world95.pdf Post your work as a comment to this page. Remember to include the names of everyone in your group. Professor Shortell's blog Group Exercise: Percentage Tables Mon, 09/30/2013 - 1:52pm — Professor Shortell To reinforce our discussion of levels of measurement and percentage tables, and to give you some more practice with R, we'll look one more time at crosstabulations. Use the Measurement script to guide your R commands: http://www.courseserve.info/files/SOCY7112meas.r The dateframe is: http://www.courseserve.info/files/CBS2011r.csv Codebook: http://www.courseserve.info/files/CBS2011.pdf In this exercise, I want you to produce two percentage tables. First identify a dependent variable that you want to explain. You must select a categorical variable, since we are using crosstabs. Produce a table in R for that variable and interpret the results in a sentence or two. Next, find an independent variable that you think might influence your dependent variable and run the two-way percentage table. Interpret the results. What can you say sociologically about the effect of your independent variable on your dependent variable? Post your response as a comment to this page. Include the names of all group members. Professor Shortell's blog 12next ›last » All content on this site is copyright © 2005-2013 by Professor Timothy Shortell, unless retained by the original owners. No infringement of rights is intended or implied. ```
``` Search Site Information Professor Shortell will hold office hours on Monday and Thursday from 4-6PM, Wednesday from 6-8PM, & by appointment in 3501 James. * About Professor Shortell * R Resources R Scripts Installation Introduction to R Re-introduction to R Textbook Examples Measurement Data Input Central Tendency Graphing 1 Graphing 2 Crosstabulation and Odds Confidence Intervals Means T-test ANOVA Factorial ANOVA More on Factorial ANOVA Correlation and Linear Regression Creating Dummy Variables Recoding and Indexing Multiple Linear Regression More Multiple Linear Regression More on Least Squares Regression Logistic Regression More Logistic Regression Even More Logistic Regression Comparing Least Squares and Logistic Regression Multidimensional Scaling Clustering Comparison of Latent Structures Recent blog posts Group Exercise: Building a Causal ModelComparing Least Squares and Logistic RegressionIn Anticipation of the Final ExaminationGroup Exercise: Logistic Regression 2More on Logistic RegressionGroup Exercise: Logistic RegressionIntroduction to Logistic RegressionGroup Exercise: Comparing Regression ModelsRecoding and IndexingGroup Exercise: Multiple Least Squares Regressionmore ```
``` ```
``` ```