logo.gif The source for online courses
in statistics
 ÖÐÎÄ Course Login
Home > Resources > Discussion Boards
statistics.com
statistics.com
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Categorial variable impact on success rate

 
Post new topic   Reply to topic    statistics.com Forum Index -> Statistical Methods
View previous topic :: View next topic  
Author Message
henthreery



Joined: 22 Jul 2010
Posts: 3

PostPosted: Thu Jul 22, 2010 9:18 am    Post subject: Categorial variable impact on success rate Reply with quote

Hello,

First time posting here, and am excited to have discovered this community. I'm hoping to get some advice on a modeling problem I'm working on (not homework).

The goal is to measure the difficulty of different environments (a categorical variable) on the probability of successfully performing a task.

I have few thousand subjects who were initially assigned to one of about 20 environments where they each performed a task a different number of times resulting in some number of successes. The subjects are then assigned to a different environment, and repeat the task. We want to know what effect the environment has on the probability of success.

However, the number of task repetitions for each subject in an environment is not controlled, and ranges from 1 to 600 trials. Furthermore, the same subject performs a different number of trials in the second environment than in the first.

So I have data that looks like (N = trials, X = successes, Rate = X/N):

Code:

Subj#  Env1     N1  X1 Rate1     Env2   N2 X2 Rate2
1         A    100  20   .20        B  200 50   .25
2         A     80   8   .10        C  100 30   .30
3         D    120   6   .05        A   90  9   .10
4         B    400  40   .10        C  300 30   .10
5         C    150  30   .20        A   10  0   .00
[...]


Each subject is believed to have their own true level of skill in
performing the task, so the success rate of each subject in both
environments is correlated. But we don't really care about the
subject's true skill level, we're just interested in the effect of the
environment on the success rate.

The range of true skill levels in the population is believed to be roughly between 0 - 20% success rate. It is not guaranteed that there will be subjects assigned to every possible pair of environments.

The logistic(-ish) regression model I'm trying to fit is:

logodds (Rate1) = logodds(Rate2) + Dj1 - Dj2

Where Dj is the difficulty rating of environment j. One environment
is considered to be the reference point (environment A in the data
above) and assigned a difficulty of 0, so the others' difficulties are
expressed relative to the reference point.

Questions:

1) How to handle observations with 0 successes, such as with subject
#5's second environment above, which you can't take the log of.

2) I think I would want to give more weight to subjects with more
trials (N) than others. But because of the differences between N1 and
N2, I'm not sure how much weight to give each subject. I've tried
weighting by the lesser of the N1 and N2, or by the average of the
two, but don't have a good justification for either one.

3) How should I handle the fact that there is measurement
uncertainty in both the dependent and independent variables (observed
success rates are a sample from true ability given the environment).

4) Is there a better way to try to model this? Pointers to
packages/functions in R that would be useful in dealing with this
model would be appreciated.

Thank you,
Henthreery
Back to top
View user's profile Send private message
alethephant



Joined: 06 Sep 2006
Posts: 200
Location: Virginia Beach

PostPosted: Thu Jul 22, 2010 10:56 am    Post subject: Categorical impact on success rate Reply with quote

With respect to your dataset:

1. You have ~ 20 "environments" labeled "A", "B", etc.

2. You have many subjects, labeled by number.

3. Each subject, which is assigned to a 1st environment repeats an unspecified task a number of times with a success rate. The number of trials is denoted N, the number of successes as X and the "Rate" as X/N.

4. The subject is then assigned to a second environment, and repeats the same task again. You are using "1" and "2" to denote the two environments, with the order "1" first and "2" second in a crossover trial for that subject.

Is all of this correct?

If so, how did you assign environments to subjects and the crossovers?

So far your problem appears to be a standard two-arm crossover trial experiment with a binary outcome variable measured in replicate.

A key question is the carryover effect. Is there a learning curve for the activity you are measuring? I.e., do you expect better performance after completing the first set on the first environment, so that a better score will be obtained on the second environment, everything else being equal?

If you measured a subset of subjects on the same environment twice, you would measure the carryover effect.
Back to top
View user's profile Send private message
henthreery



Joined: 22 Jul 2010
Posts: 3

PostPosted: Thu Jul 22, 2010 11:53 am    Post subject: Reply with quote

Hi, and thanks for the quick reply.

You are correct in your description of the data.

This was not a controlled experiment, and the data were simply collected "in the wild" from historical observations. Thus, the assignment of the 2nd environment was chosen by factors beyond my control, rather than with an experimental design in mind.

To a first approximation could be considered non-uniformly random. I say non-uniformly because there are "clusters" of assignment pairs in the data -- a subject in environment B might get moved to environment C 80% of the time, environment D 20% of the time, and environment A or E 0% of the time. I think this affects the balance of the data, hence my concerns about how to weight subjects.

We can assume the carryover effect & learning curve are washed out between assignments.

Thanks again for any advice or pointers on how to proceed.

Henthreery
Back to top
View user's profile Send private message
alethephant



Joined: 06 Sep 2006
Posts: 200
Location: Virginia Beach

PostPosted: Thu Jul 22, 2010 5:05 pm    Post subject: Categorical impact on success rate Reply with quote

What you have is something similar to an incomplete blocks experiment. Subject is similar to "blocks", and is a random nuisance factor.

The optimal way to analyze the data is a generalized linear mixed model, i.e., logistic regression with a random Subjects factor.

You could use a fixed effect model, but

fit1<- glm(cbind(X, N-X) ~ Subj + Env, data=yourdata, family=binomial(link='logit'))

will result in "Subj" with thousands of levels to be fitted. "Env" is okay as a fixed factor with 20+ levels. Note that the grouped data will be properly weighted in the fit, so you don't have to worry about it. You don't need to worry about X = 0 either.

Modeling "Subj" as a random factor is much, much better in this circumstance, but requires a GLMM program. In R,

require('MASS')
require('nlme')
fit2<- glmmPQL(cbind(X, N-X) ~ Env, random = ~ 1 | Subj, data=yourdata, family = binomial(link='logit')))

You should be aware that mixed modeling for large datasets is a current research area, and generalized linear mixed models also. So problems with the software may occur when you do your fits. In particular, expect it may take a very long time to converge on the optimum solution.

The information you are interested in is the coefficient set of "Env", which will be differences from the Env = A effect. The "Subj" effect will be reported as a standard deviation of the effects of a population of subjects.

If you need more detailed help with the actual dataset,

1. Contact a local statistician.
2. Enroll in the Mixed and Hierarchical Models course at statistics.com and ask you questions of the instructor.
3. Buy some consulting at statistics.com.

Good luck!
Back to top
View user's profile Send private message
henthreery



Joined: 22 Jul 2010
Posts: 3

PostPosted: Sat Jul 24, 2010 12:12 pm    Post subject: Reply with quote

This is great, and just what I needed! Thank you for a fantastic, detailed reply.
Back to top
View user's profile Send private message
kevin84johnson



Joined: 25 Aug 2010
Posts: 3

PostPosted: Wed Aug 25, 2010 5:44 am    Post subject: Reply with quote

Thank you so much for the very informative information.. Nice post.. Thanks guys..
_________________
modern furniture online
modern sofas
Back to top
View user's profile Send private message Yahoo Messenger
Display posts from previous:   
Post new topic   Reply to topic    statistics.com Forum Index -> Statistical Methods All times are GMT - 5 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group