Group Exercise: Building a Causal Model

Use the CILS data to test a sociological causal model. Calculate the least squares or logistic regression, as appropriate. Interpret the results.

Post your interpretation as a comment to this page.

Comparing Least Squares and Logistic Regression



CILS2010<-read.csv("http://www.courseserve.info/files/CILS2010.csv")
attach(CILS2010)

Eng=0; Eng<-V24+V25+V26+vV27
Eng2=0; Eng2[Eng>15]<-1; Eng2[Eng<=15]<-0

# Least squares linear regression
summary(lm(V60~Eng+V18+V19+V55+V72+V44+V11+V17))

Call:
lm(formula = V60 ~ Eng + V18 + V19 + V55 + V72 + V44 + V11 +
V17)

Residuals:
Min 1Q Median 3Q Max
-3.4792 -0.3826 0.2338 0.4215 1.4193

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.9532905 0.2894521 13.658 < 2e-16 ***
Eng 0.0790067 0.0092637 8.529 < 2e-16 ***
V18 0.2221707 0.0272793 8.144 5.86e-16 ***
V19 -0.0635706 0.0158466 -4.012 6.20e-05 ***
V55 0.0017498 0.0594642 0.029 0.9765
V72 -0.0013663 0.0007504 -1.821 0.0688 .
V44 -0.0399862 0.0165164 -2.421 0.0155 *
V11 0.0960830 0.0400204 2.401 0.0164 *
V17 0.0193007 0.0382207 0.505 0.6136
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6816 on 2598 degrees of freedom
(2655 observations deleted due to missingness)
Multiple R-squared: 0.07556, Adjusted R-squared: 0.07271
F-statistic: 26.54 on 8 and 2598 DF, p-value: < 2.2e-16


# Logistic regression
Degree=0; Degree[V60>=4]<-1; Degree[V60<4]<-0

summary(glm(Degree~Eng+V18+V19+V55+V72+V44+V11+V17,family="binomial"))

Call:
glm(formula = Degree ~ Eng + V18 + V19 + V55 + V72 + V44 + V11 +
V17, family = "binomial")

Deviance Residuals:
Min 1Q Median 3Q Max
-2.8453 0.2487 0.3098 0.3961 1.5044

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.159578 1.586683 1.361 0.173493
Eng 0.269895 0.041050 6.575 4.87e-11 ***

In Anticipation of the Final Examination

Use the comments to this page to suggest topics to include in our review session for the final examination. The review session will be M 12/16. The exam is due M 12/23.

Group Exercise: Logistic Regression 2

Your task is to improve upon the model in the example we discussed (predicting the odds of identifying as a Republican) by adding some attitudinal variables and other demographics. You may need to do some recoding to create binary variables.

Explain your model and discuss the results. Perform the significance test using the anova() function to see if your model fits significantly better than the example. Post your interpretation as a comment to this page.

BONUS point to the group with the best fitting model.

As you may recall, the form for the significance test is:
anova(glm(repub~spending+married+dem_unionhh+black+hispanic+collegegrad+northeast+northcentral+south,family="binomial"), glm(repub~spending+married+dem_unionhh+black+hispanic+collegegrad+northeast+northcentral+south+YOUR_ADDITIONAL_VARIABLES_HERE,family="binomial"), test="Chisq")

More on Logistic Regression

# We'll look at another example of logistic regression and
# model specification. This time, we'll use the ANES 2012.

ANES2012<-read.csv("http://www.courseserve.info/files/ANES2012r.csv")
attach(ANES2012)

# First, we'll create a binary vector for party affiliation
# that compares just Democrats and Republicans. With the
# new variable, "repub", we'll be building a model that
# tries to predict the characteristics associated with the
# odds of identifying as a Republican rather than a Democrat.

repub=0; repub[pid_self==1]<-0; repub[pid_self==2]<-1

# Now, we'll specify our base model.

# We'll use the index that measures attitude toward federal
# spending that we created earlier. The source code is here:
source("http://www.courseserve.info/files/SOCY7112spending.r")

# Let's create a binary vector for marital status (married, not married):
married=0; married<-ifelse(dem_marital==1,1,0)

# We'll use a pair of binary vectors to measure race. The
# code is here:
source("http://www.courseserve.info/files/SOCY7112race.r")

# We'll use binary vectors for education and region:
source("http://www.courseserve.info/files/SOCY7112degree.r")
source("http://www.courseserve.info/files/SOCY7112anesregion.r")

# Now we can run the base model.

summary(glm(repub~spending+married+dem_unionhh+black+hispanic+collegegrad+northeast+northcentral+south,family="binomial"))

Group Exercise: Logistic Regression

Select a binary dependent variable that you want to explain. Select five or more independent variables. You may need to recode variables before you can include them in your model.

If you use USCounties, you can use the R scripts I've prepared to create binary vectors for EconType04 and Region. Use the source("...") function:
http://www.courseserve.info/files/SOCY7112factors.r -- for EconType04
http://www.courseserve.info/files/SOCY7112regions.r -- for Region

Perform the hypothesis test and interpret the results.

Introduction to Logistic Regression

We use logistic regression when we want to model a dependent variable that is binary. (Ordinary least squares regression requires a numeric dependent variable, as you will recall.)

Let's consider an example where we are trying to account for why some US counties experience persistent poverty. Note that the regression function is glm() rather than lm(). With glm() -- the generalized linear model -- we need to specify which distribution to use for the test; since we are testing a binary dependent variable, we specify family="binomial". The regression coefficients are expressed in logits -- that is, log of odds ratios. It is useful to convert the coefficients back to odds, which we do with exp(coef(glm(...))).

The goodness of fit is expressed by deviance (or, -2LL). We use ANOVA to test for the significance of the model; we compare the empty model, using only the intercept and no predictors, as the comparison; then we compare the full model to see if it accounts for more of the variation in the dependent variable.

When it makes sense to do so, we can test a partial model (instead of the empty model) and the full model. In this case, we are testing for the increase in explanatory power when we add the region factor.

http://www.courseserve.info/files/SOCY7112logistic.session

Group Exercise: Comparing Regression Models

Identify a numeric dependent variable. Select at least five potential independent variables. You may need to create an index or create binary variables. Identify the independent variable that you think is the most important causal factor.

First, calculate the regression model without the variable you selected as most important. Note the goodness of fit of the incomplete model. Next, calculate the model with the most important variable included (along with the others). Note the change in goodness of fit. This is the causal contribution of the most important variable (as you identified it). Interpret the results, if appropriate, of the second model.

Post your results as a comment to this page. Include everyone's name in your group.

Recoding and Indexing

# We'll create a simple index variable as a sum of a series of related
# variables. This makes sense if all the variables are on the same
# answer set and if we can identify an order to the answer set -- that
# is, the level of measurement is ordinal.
#
# It will be necessary to recode the variables in this example prior
# to creating the sum score.

ANES2012<-read.csv("http://www.courseserve.info/files/ANES2012r.csv")
attach(ANES2012)

# We'll use a series of questions measuring attitude toward federal
# spending. The original variables are measured on a three point answer
# set, where 1=increase, 2=decrease, and 3=keep same. We can consider
# this an ordinal scale, from liberal to conservative (with regard to
# the role of government), if we recode to put the 'keep same' answer
# in the middle.
spend1=0; spend1[fedspend_ss==1]<-2; spend1[fedspend_ss==2]<-0; spend1[fedspend_ss==3]<-1
spend2=0; spend2[fedspend_schools==1]<-2; spend2[fedspend_schools==2]<-0; spend2[fedspend_schools==3]<-1
spend3=0; spend3[fedspend_scitech==1]<-2; spend3[fedspend_scitech==2]<-0; spend3[fedspend_scitech==3]<-1
spend4=0; spend4[fedspend_crime==1]<-2; spend4[fedspend_crime==2]<-0; spend4[fedspend_crime==3]<-1
spend5=0; spend5[fedspend_welfare==1]<-2; spend5[fedspend_welfare==2]<-0; spend5[fedspend_welfare==3]<-1
spend6=0; spend6[fedspend_child==1]<-2; spend6[fedspend_child==2]<-0; spend6[fedspend_child==3]<-1
spend7=0; spend7[fedspend_poor==1]<-2; spend7[fedspend_poor==2]<-0; spend7[fedspend_poor==3]<-1
spend8=0; spend8[fedspend_enviro==1]<-2; spend8[fedspend_enviro==2]<-0; spend8[fedspend_enviro==3]<-1

# Now we can create an index variable by summing the new spendX variables.
spending<-spend1+spend2+spend3+spend4+spend5+spend6+spend7+spend8

# Let's look at another example of recoding.
married=0; married<-ifelse(dem_marital==1,1,0)

# Now let's build a least squares regression model to predict attitude toward spending.

Group Exercise: Multiple Least Squares Regression

Select a DV related to party affiliation or political view. (Remember, your DV must be numeric.) Identify at least five IVs that you believe will predict your DV. (Remember, your IVs must be numeric or binary.) Perform the hypothesis test and interpret the results. Discuss your model specification.

Post your work as a comment to this page. Include the names of all group members in your post.

An Example of Correlation and Regression

I've set up the most recent edition of the ANES. Let's look at an example.

> ANES2012<-read.csv("http://www.courseserve.info/files/ANES2012r.csv")
> attach(ANES2012)
> summary(ftgr_tea)
> plot(ftgr_rich,ftgr_tea)
> abline(lm(ftgr_tea~ftgr_rich))
> cor.test(ftgr_tea,ftgr_rich,test="pearson")
> summary(lm(ftgr_tea~ftgr_rich))
> summary(lm(ftgr_tea~ftgr_rich+ft_rep+ft_dem+ftgr_xfund+ftgr_bigbus+ftgr_fedgov+gender_respondent))

It is clear that linear regression is more informative than correlations, and that multiple least squares regression is a sophisticated modeling tool. We'll practice with some additional examples.

Some things to consider:

  • The DV must be numeric
  • The IVs (predictors) can be numeric or binary
  • Predictors should not be highly correlated (above 0.9, for example)
  • Model specification is very important

    I've created a R script to create a series of binary variables for race. To run the script, at the prompt, type:

    source("http://www.courseserve.info/files/SOCY7112racevectors.r")

    and hit ENTER.

  • Update to Examination 2

    I have deleted question 16 (about the table in Langford and Mackinnon) from the assignment. I should have moved the reading to the next section, where it would make more sense. I won't grade that question, so you can leave it blank.

    Group Exercise: Factorial Analysis of Variance

    We'll use the ANES for this exercise. Find a new dependent variable (numeric) to test for the effects of gender (V081101) and region (v081204). Perform the hypothesis test. Interpret the results.

    Post your response as a comment to this page. Remember to include the names of everyone in your group.

    Group Exercise: Analysis of Variance

    Pick a numeric DV (one of the Feeling Thermometers) and compare by Party Affiliation (V083097). Perform the hypothesis test.

    Group Exercise: Hypothesis Testing with Means

    Perform a one group or two group hypothesis test with means. State the research and null hypotheses. Produce the table with R and discuss the results.

    What would be a good follow-up hypothesis?

    Post your work as a comment to this page. Remember to include everyone's name.

    Group Exercise: Confidence Interval

    We can estimate the confidence interval for a mean using the t.test function. You need, of course, a numeric variable to calculate a mean. Identify the variable and a population value, mu, to test for (which should be derived from theory or make sense at face value). In the World95 data, we can use the percent population change variable to test for a population increase.

    > t.test(pop_incr, mu=0)

    This results in:

    One Sample t-test

    data: pop_incr
    t = 14.667, df = 108, p-value < 2.2e-16
    alternative hypothesis: true mean is not equal to 0
    95 percent confidence interval:
    1.455019 1.909752
    sample estimates:
    mean of x
    1.682385

    We can also test for a population proportion, using the prop.test function. In this example, we'll use the variable Q2 from the CBS2011 dataframe. (This variable asks respondents if they think the country is going in the right direction, 1, or the wrong direction, 2.)

    We can see the table with the CrossTable function.

    > CrossTable(Q2)

    Which gives us:
    Cell Contents
    |-------------------------|
    | N |

    Group Exercise: Odds Ratio

    Identify two binary variables in one of our dataframes. (You want to be able to test a causal relationship, so identify one as the potential cause and the other as the effect.) Calculate the odds ratio and Fisher test and interpret the results.

    Codebooks:
    http://www.courseserve.info/files/ABC2010.pdf
    http://www.courseserve.info/files/CBS2011.pdf
    http://www.courseserve.info/files/ANES2008_Codebook.pdf

    Post your work as a comment to this page. Remember to include the names of everyone in your group.

    Group Exercise: Central Tendency and Variability

    Identify one numeric variable in one of our dataframes. Calculate the measures of central tendency and dispersion and interpret the results.

    Codebooks:
    http://www.courseserve.info/files/ABC2010.pdf
    http://www.courseserve.info/files/CBS2011.pdf
    http://www.courseserve.info/files/ANES2008_Codebook.pdf
    http://www.courseserve.info/files/world95.pdf
    http://www.courseserve.info/files/PUMS2000.pdf
    http://www.courseserve.info/files/USCounties.pdf

    Post your work as a comment to this page. Remember to include the names of everyone in your group.

    Group Exercise: Graphing

    Produce a scatterplot of two numeric variables and describe the relationship. (Bonus for adding the the regression line to the graph. Attach your graph as a PNG or JPG file to your comment.)

    Codebooks:
    http://www.courseserve.info/files/ABC2010.pdf
    http://www.courseserve.info/files/CBS2011.pdf
    http://www.courseserve.info/files/ANES2008_Codebook.pdf
    http://www.courseserve.info/files/world95.pdf

    Post your work as a comment to this page. Remember to include the names of everyone in your group.

    Group Exercise: Percentage Tables

    To reinforce our discussion of levels of measurement and percentage tables, and to give you some more practice with R, we'll look one more time at crosstabulations.

    Use the Measurement script to guide your R commands:
    http://www.courseserve.info/files/SOCY7112meas.r

    The dateframe is:
    http://www.courseserve.info/files/CBS2011r.csv

    Codebook:
    http://www.courseserve.info/files/CBS2011.pdf

    In this exercise, I want you to produce two percentage tables. First identify a dependent variable that you want to explain. You must select a categorical variable, since we are using crosstabs. Produce a table in R for that variable and interpret the results in a sentence or two. Next, find an independent variable that you think might influence your dependent variable and run the two-way percentage table. Interpret the results.

    What can you say sociologically about the effect of your independent variable on your dependent variable?

    Post your response as a comment to this page. Include the names of all group members.

    Syndicate content