Sunday, November 16, 2014

Maximum likelihood and other parameter estimations

First, a basic definition: A parameter is an unknown, fixed value that describes a characteristic. For example, a mean is a characteristic that describes the average over a population. The true mean is usually not known, but rather estimated.

When we fit a model to our data, we get parameters such as the regression coefficients (β's). In spatial stats, we use a (semi)variogram function to estimate the parameters range, sill, and nugget effect. An introduction to the semivariogram and its parameters may be found here.

A few common methods of parameter estimation used in spatial stats are the least squares (OLS or more commonly WLS) and the likelihood based methods (maximum likelihood MLE or restricted maximum likelihood REML).

Least squares methods fit a model by minimizing the distance between the observed data and the best fit line. Likelihood based methods use the observed data to estimate the population parameters using established distributions. When you code a likelihood estimation, you will input parameters and an underlying distribution. For example, for a spatial stats dataset, you would first investigate the semivariogram to estimate the nugget effect parameter and the distribution (i.e. exponential or linear), and then model the data via MLE.

In R, the likfit command in the package geoR models likelihood based methods. In the same package, variofit models least squares.

Weighted least squares

... or Why can't we just be ordinary squares?

In fitting your linear model, you may be interested in generating a prediction line that describes the relationship between your predictor(s) and your outcome. If you have constant variance in the errors (homoskedasticity), an ordinary least square (OLS) approach is used to fit the model to the data and generate a best fit line. A best fit line essentially minimizes the distance between the observed data and the predictions made by the model. If your data shows constant variance in the errors AND the errors are normally distributed, then OLS is the maximum likelihood estimator.

However, in spatial statistics (the analysis of data with a spatial component that considers spatial dependency) we often use data that violate the rule of error constant variance (heteroskedasticity). In this case, we use weighted least square (WLS) to fit the model to the data and generate a best fit line. In WLS, the error assumptions are that errors are normally distributed with mean vector 0 and nonconstant variance-covariance matrix σ2W, where W is a diagonal matrix. See this post from Penn State for a short intro to the nonconstant variance-covariance matrix.


Interpretting linear models in R

If you're new to R and stats, check out this awesome post over at the yhatq blog. It walks you through everything, from the code to the analysis, in simple, straight-forward language with code output.

On residuals:
If our residuals are normally distributed, this indicates the mean of the difference between our predictions and the actual values is close to 0 (good) and that when we miss, we're missing both short and long of the actual value, and the likelihood of a miss being far from the actual value gets smaller as the distance from the actual value gets larger.

On variable p-values:
Probability the variable is NOT relevant. You want this number to be as small as possible. If the number is really small, R will display it in scientific notation. In or example 2e-16 means that the odds that parent is meaningless is about 15000000000000000
On R-squared:
Metric for evaluating the goodness of fit of your model. Higher is better with 1 being the best. Corresponds with the amount of variability in what you're predicting that is explained by the model.
On the F-test and resulting F-stat:
This takes the parameters of our model (in our case we only have 1) and compares it to a model that has fewer parameters (sic). In theory the model with more parameters should fit better. If the model with more parameters (your model) doesn't perform better than the model with fewer parameters, the F-test will have a high p-value (probability NOT significant boost). If the model with more parameters is better than the model with fewer parameters, you will have a lower p-value.
All quoted text from post "Fitting & Interpreting Linear Models in R" by yhat, published May 18, 2013 at http://blog.yhathq.com/posts/r-lm-summary.html

Thursday, April 11, 2013

Changing the aspect ratio of a scatter plot in STATA

This took me forever to figure out, because I kept searching for terms like "change width" which directs you to line width code. Finally I found a snippet of code for xsize and searching that led me to this handy tutorial on aspect ratios and other fun scatter plot techniques.

My code:

twoway (scatter mathach newid, ylabel(-10(5)35)) (scatter meanmathach newid, connect(1) clcolor(red) sort), ysize(3) xsize(5)

Per the bolded code, the resulting graph will be produced in a 3x5 aspect ratio.

Tuesday, March 5, 2013

Fun with GLM links

I can never keep this straight: When to use link(log) vs. link(logit) in generalized linear models:

Use link(log) when you are after relative risks.
Use link(logit) when you are after odds ratios.

Friday, February 8, 2013

Making notes in your code: STATA vs R

I'm used to coding in R at this point, so it took me a second to figure this out:

# This is a note in R code

vs.

* This is a note in STATA code

Other small things that are hard to get use to: in R, if your code spills into the next line, you just keep typing (I usually indent just to keep it looking clean); but in STATA, you have to include "///" at the end of an unfinished line of code, to tell it to continue reading on the next line.

STATA seems to stop running code when it encounters a problem (?) vs. R which keeps going, throwing up error messages when appropriate but following thru to the end.

Tuesday, November 20, 2012

Simple vector analysis in R

First, create a vector:

ClassGrades<- c(94, 95, 91)
#defines a vector called "ClassGrades" by concatenating individual grades on the exam

Then, run your simple analysis

mean(ClassGrades)
#returns the mean on the exam

hist(GlassGrades)
#produces a histogram of the exam grades

Thursday, August 9, 2012

Paste special and transpose in Excel

There are a few ways to do this, but if you're looking for a non-macro shortcut, try this:

Select the cells you want to copy. Ctrl-C to copy. Select the cells where you want to paste. Holding down Alt, type E S E. Choose the options you want in the paste special box (i.e. check values, transpose). Click OK.

Many thanks to MrExcel.com (John) for this tip! He also spells out the macros if you want to go that route here.

Wednesday, August 8, 2012

Separating a text string to cells, columns in Excel

I received a string of numbers in an email and needed to manipulate them, so I copied them to a text document and imported the text into Excel 2010 (under the "data" tab, click on Get External Data: From Text). I chose the "Delimited" option instead of the default "Fixed width", clicked "Next", chose "Delimiters: Space" instead of the default "Tab" (note that you can chose multiple delimiter options), clicked "Next", set the column formats, and clicked "Finish".

Source

Thursday, January 26, 2012

Pissing off Heidi Klum

"Essentially, all models are wrong, but some are useful"
- George EP Box, Empirical Model-Building and Response Surfaces