GPT-4 LLM simulates people well enough to replicate social science experiments

Demo: Predicting social science experimental results using LLMs

Luke Hewitt*, Ashwini Ashokkumar*, Isaias Ghezae, Robb Willer

This demo accompanies the paper Prediction of Social Science Experimental Results Using Large Language Models and can be used for predicting experimental treatment effects on U.S. adults. To manage costs of hosting this demo publicly, it uses GPT-4o-mini rather than GPT-4.

FAQs

What does this tool do?

This tool uses Large Language Models (LLMs) to predict experimental treatment effects on survey outcomes for U.S. adult samples. Users can select a dependent variable and one or more text-based treatment messages. Once you click Submit, the tool uses an LLM to simulate American participant responses in an RCT experiment. It then displays the predicted treatment effect for each treatment. Note that this is a technical demo, and not a substitute for conducting experiments with real human participants. It is only a way to predict experimental results, and should be used as a complement to, rather than a replacement for, research with humans participants.

How accurate are the predictions?

We have conducted a series of large-scale assessments of the accuracy generated using LLMs, details of which are provided in the paper. Briefly:

  • For survey experiments (experiments with text-based treatments and measures), we found that our approach was approximately 70-80% accurate in predicting the direction of a contrast between two experimental conditions. In our assessment of survey experiments, predicted effect sizes were strongly correlated with actual effect sizes (r = .85).
  • For large, many-treatment survey experiments, we found that the accuracy predictions (r = .37) surpassed those of expert human forecasters (r = .25).

What can I use this for?

We believe that LLM-based simulation of experiments may have applications in several areas:

  • Intervention design. As LLMs can evaluate many treatment messages in very little time, they may help optimize the development of effective message-based interventions (e.g. to promote public health behaviors) by helping researchers narrow the field of messages to test in an RCT.
  • Minimizing harm to human participants. For research that involves potential risk to human participants (such as exposing subjects to misinformation in order to subsequently test the impact of an intervention), LLMs may be used to conduct a simulated test of an intervention before exposing any human participants.
  • Pilot testing of study materials. LLMs may help researchers pilot test study materials prior to launching experiments, thereby informing decisions about which materials to use.
  • Predicting subgroup effects. Our assessment did not reveal substantive differences in predictive accuracy of the model across racial, gender, and political subgroups in the US. However, the archive of experiments we used for testing did not include many significant heterogeneous treatment effects, making it difficult to rule out the possibility of biases in prediction. Further research is required for testing if biases in LLM predictions emerge where heterogenous effects are present

Note that this is a technical demo, based on our evaluation of LLMs’ predictions for experiments conducted in the US. Studies are beginning to evaluate the strengths and limitations of using LLMs to simulate participants, including concerns about bias, risks of over-reliance, and misuse. For discussion, see [1][2].

Why are there no confidence intervals?

Because it is easy to generate extremely large samples of simulated participants, a confidence interval on the simulated treatment effects would be extremely narrow, yet this does not capture the error of the model in predicting human responses.

Can I make predictions on demographic subpopulations?

Yes, this is accessible using the advanced (️) menu. Note that this feature is still restricted to U.S. adults.

We recommend extra caution in interpreting estimated subpopulation effects. We find that predictions generated for members of racial, gender, and partisan groups in the US are similarly accurate. However, other work focused on simulating survey responses finds that LLMs’ responses are often biased against groups with less access to the internet or which are historically underrepresented or misrepresented in news or other media (see e.g. [1][3]).

Can I compare multiple treatments at once?

Yes, this feature is accessible using the advanced (️) menu. You can type in multiple treatments or upload a txt file with one treatment per line (up to 10 treatments).

What control group does the demo use?

All predicted treatment effects are relative to a “control group” that received no message. That is, we compare LLM-predicted outcome scores for simulated participants read a treatment message versus no message. If you wish to include another control group, simply add this as an additional treatment (see prior FAQ answer).

Are there usage restrictions?

This tool contains guardrails to prevent misuse. Our goal with these guardrails is to support scientific research uses, while minimizing the possibility that the tool can be used for socially harmful purposes (such as optimizing misinformation).

You can view the specific guardrails currently implemented here.

Can I use different dependent variable to the ones shown?

Yes, you may create your own dependent variables by selecting the Custom option in the topic dropdown. Note that the upper end of the scale should correspond to the intended direction of the treatment.

Original Post>

Enjoyed this article? Sign up for our newsletter to receive regular insights and stay connected.