This is the abstract for Ambady, Nalini, and Robert Rosenthal. "Half a Minute: Predicting Teacher Evaluations From Thin Slices of Nonverbal Behavior and Physical Attractiveness." Journal of Personality and Social Psychology 64, no. 3 (1993): 431–441. https://doi.org/10.1037/0022-3514.64.3.431.
The accuracy of strangers' consensual judgments of personality based on "thin slices" of targets' nonverbal behavior were examined in relation to an ecologically valid criterion variable. In the 1st study, consensual judgments of college teachers' molar nonverbal behavior based on very brief (under 30 s) silent video clips significantly predicted global end-of-semester student evaluations of teachers. In the 2nd study, similar judgments predicted a principal's ratings of high school teachers. In the 3rd study, ratings of even thinner slices (6-s and 15-s clips) were strongly related to the criterion variables. Ratings of specific micrononverbal behaviors and ratings of teachers' physical attractiveness were not as strongly related to the criterion variable. These findings have important implications for the areas of personality judgment, impression formation, and nonverbal behavior.
The paper is accessible in multiple places on the web, e.g. here. The paper extended NaliniAmbady's dissertation research at Harvard, where Rosenthal was her advisor.
The careful study design is resistant to many of the criticisms that "thin slice" research has been submitted to since the work of Todorov and the "Blink!" popularization. The work builds in three incremental studies, it specifically checks for the attractiveness bias, and it used not only student but also principal evaluation to check for comparability. It did not work on thinning the slices until the third study.
Ambady was bitten, however by the assumption that student evaluations were an informative signal.
The criterion was end-of-the-semester student evaluations. Although student achievement (adjusted for student ability) might be the best possible criterion of effective teaching, it is very difficult to obtain such data. Teacher effectiveness in the real world is often evaluated solely on the basis of ratings of supervisors and students. Therefore, we used end-of-the-semester student ratings of teachers as a measure of teacher effectiveness. Considerable evidence supports the validity of student evaluations: Student ratings are consistent over time and across raters; correlate positively with expert, colleague, and administrator ratings; are independent of extraneous characteristics or characteristics of the students themselves; correlate significantly with how much students actually learn; and, last, do not change appreciably with greater age of the student rater and reflection by the student (Abrami, d'Apollonia, & Cohen, 1990; Centra, 1979; Cohen, 1981; Feldman, 1989a, 1989b; Howard, Conway, & Maxwell, 1985; Kulik & Kulik, 1974; Leventhal, Perry, & Abrami, 1977; Marsh, 1984; McKeachie, 1979; Trent & Cohen, 1973). Thus, student evaluations seem to be a valid means of evaluating teacher effectiveness. (432)
That stance was substantially challenged by the research of Boring, Ottoboni & Stark (2016), which used substantial data sets to argue that
Student evaluations of teaching (SET) are widely used in academic personnel decisions as a measure of teaching effectiveness. We [i.e. Boring, Ottoboni & Stark] show:
SET are biased against female instructors by an amount that is large and statistically significant.
The bias affects how students rate even putatively objective aspects of teaching, such as how promptly assignments are graded.
The bias varies by discipline and by student gender, among other things.
It is not possible to adjust for the bias, because it depends on so many factors.
SET are more sensitive to students’ gender bias and grade expectations than they are to teaching effectiveness.
Gender biases can be large enough to cause more effective instructors to get lower SET than less effective instructors.
These findings are based on nonparametric statistical tests applied to two datasets: 23,001 SET of 379 instructors by 4,423 students in six mandatory first-year courses in a five-year natural experiment at a French university, and 43 SET for four sections of an online course in a randomized, controlled, blind experiment at a US university.
This research appeared three years after Ambady's passing away in 2013.
Ambady's & Rosenthal's work was also hampered by the winner's curse, which their three-times repetition of their study could not fully escape from. (Notice that only some of the concerns that Ioannidis expresses in his 2005 paper apply to Ambady's work, and that the suggestion to use larger studies is silly for a PhD undertaking altogether.)