OverviewAn online calculator was created to help college faculty and K-12 teachers discern the adequacy of a sample size and/or response rate when interpreting student evaluation of teaching (SETs) results. The online calculator can be accessed here: http://go.ncsu.edu/cvm-moe-calculator. About the calculator One of the most common questions consumers of course and instructor evaluations (also known as “Student Evaluations of Teaching”) ask pertains to the adequacy of a sample size and response rate. Arbitrary guidelines (e.g., 50%, 70%, etc.) that guide most interpretive frameworks are misleading and not based on empirical science. In truth, the sample size necessary to discern statistically stable measures depends on a number of factors, not the least of which is the degree to which scores deviate on average (standard deviation). As a general rule, scores that vary less (e.g., smaller standard deviations) will require a smaller sample size (and lower response rate) than scores that vary more (e.g., larger standard deviations). Traditional MOE formulas do not account for this detail, thus this MOE calculator is unique in that it computes a MOE with score variation taken into consideration. Other details about the formula also differ from traditional MOE computations (e.g., use of a t-statistic as opposed to a z-statistic, etc.) to make the formula more robust for educational scenarios in which smaller samples often are the norm. This MOE calculator is intended to help consumers of course and instructor evaluations make more informed decisions about the statistical stability of a score. It is important to clarify that the MOE calculator can only speak to issues relating to sampling quality; it cannot speak to other types of errors (e.g., measurement error stemming from instrument quality, etc.) or biases (e.g., non-response bias, etc.). Persons interested in learning more about the MOE formula, or researchers reporting MOE estimates using the calculator should read/cite the following papers: James, D. E., Schraw, G., & Kuch, F. (2015). Using the sampling margin of error to assess the interpretative validity of student evaluations of teaching. Assessment & Evaluation in Higher Education, 40(8), 1123-41. doi:10.1080/02602938.2014.972338. Royal, K. D. (2016). A guide for assessing the interpretive validity of student evaluations of teaching in medical schools. Medical Science Educator, 26(4), 711-717. doi:10.1007/s40670-016-0325-9. Royal, K. D. (2017). A guide for making valid interpretations of student evaluations of teaching (SET) results. Journal of Veterinary Medical Education, 44(2), 316-322. Doi: 10.3138/jvme.1215-201r. Interpretation guide for course and instructor evaluation results Suppose a course consists of 100 students (population size), but only 35 (sample size) students complete the course (or instructor) evaluation, resulting in a 35% response rate. The mean rating for the evaluation item “Overall quality of course” was 3.0 with a standard deviation (SD) of 0.5. Upon entering the relevant values into the Margin Of Error (MOE) calculator, we see this would result in a MOE of 0.1385 when alpha is set to .05 (95% confidence level). In order to use this information, we need to do two things: First, include the MOE value as a ± value in relation to the mean. Using the example above, we can say with 95% confidence that the mean of 3.5 could be as low as 2.8615 or as high as 3.1385 for the item “Overall quality of course”. Next, in order to understand the MOE percentage, we must first identify the length of the rating scale and its relation to the MOE size. For example, if using a 4-point scale we would use an inclusive range of 1-4, where the actual length of the scale is 3 units (e.g., distance from 1 to 2; 2 to 3; and 3 to 4). So, a 3% MOE would equate to 0.09 (e.g., 3 category units x 3.00% = 0.09). Similarly, a 5-point scale would use an inclusive range of 1-5, where the actual length of the scale is 4 units. In this case, a 3% MOE would equate to 0.12 (e.g., 4 category units x 3.00% = 0.12). Finally, we would refer to the interpretation guide (below) to make an inference about the interpretative validity of the score. In the above example the MOE for the item “Overall quality of course” was 0.1385. If we are using a 4-point scale, this value falls between 0.09 to 0.15, which corresponds to 3 to 5% of the scale (this is good!). So, we could infer the 35 students who completed the evaluation (sample) is a sufficient sample size from a course consisting of 100 students (population) to yield a statistically stable result for the item “Overall quality of course”, as the margin of error falls between ± 3-5%. Note: It is important to keep in mind that 35 students are adequate in this specific example because the scores deviated on average (standard deviation) by 0.5. If the standard deviation for the item was, say, 1.0, then 35 students would have yielded a MOE of 0.2769. This value would greatly exceed 0.15, indicating the MOE is larger than 5%, and would call into question the statistical stability of the score in this scenario. For a 4-point rating scale:*Please note the interpretation guide does not consist of rigid rules, but merely reasonable recommendations. Margin of Error Margin of Error (%) Interpretive Validity* Less than 0.09 Less than ± 3% Excellent interpretive validity Between 0.09-0.15 Between ± 3-5% Good interpretative validity Greater than 0.15 Greater than ± 5% Questionable interpretative validity; values should be interpreted with caution For a 5-point rating scale:*Please note the interpretation guide does not consist of rigid rules, but merely reasonable recommendations. Margin of Error Margin of Error (%) Interpretive Validity* Less than 0.12 Less than ± 3% Excellent interpretive validity Between 0.12–0.20 Between ± 3-5% Good interpretative validity Greater than 0.20 Greater than ± 5% Questionable interpretative validity; values should be interpreted with caution Example at NC State University:
Rakotz and colleagues (2017) recently published a paper describing a blood pressure (BP) challenge presented to 159 medical students representing 37 states at the American Medical Association’s House of Delegates Meeting in June 2015. The challenge consisted of correctly performing all 11 elements involved in a BP assessment using simulated patients. Alarmingly, only 1 of the 159 (0.63 %) medical students correctly performed all 11 elements. According to professional guidelines (Bickley & Szilagyi, 2013; and Pickering et al, 2005), the 11 steps involved in a proper BP assessment include: 1) allowing the patient to rest for 5 minutes before taking the measurement; 2) ensuring patient’s legs are uncrossed; 3) ensuring the patient’s feet are flat on the floor; 4) ensuring the patient’s arm is supported; 5) ensuring the sphygmomanometer’s cuff size is correct; 6) properly positing cuff over bare arm; 7) no talking; 8) ensuring the patient does not use his/her cell phone during the reading; 9) taking BP measurements in both arms; 10) identifying the arm with the higher reading as being clinically more important; and 11) identifying the correct arm to use when performing future BP assessment (the one with the higher measurement). All medical students involved in the study had confirmed that they had previously received training during medical school for measuring blood pressure. Further, because additional skills are necessary when using a manual sphygmomanometer, the authors of the study elected to provide all students with an automated device in order to remove students’ ability to use the auscultatory method correctly from the testing process. The authors of the study reported the average number of elements correctly performed was 4.1 (no SD was reported). While the results from this study likely will raise concern among the general public, scholars and practitioners of measurement may also find these results particularly troubling. There currently exists an enormous literature regarding blood pressure measurements. In fact, there are even academic journals devoted entirely to the study of blood pressure measurements (e.g., Blood Pressure Monitoring), and numerous medical journals devoted to the study of blood pressure (e.g., Blood Pressure, Hypertension, Integrated Blood Pressure Control, Kidney & Blood Pressure Research, High Blood Pressure & Cardiovascular Prevention, etc.) Further, a considerable body of literature also discusses the many BP instruments and methods available for collecting readings, and various statistical algorithms used to improve the precision of BP measurements. Yet, despite all the technological advances and sophisticated instruments available, these tools likely are of only limited utility until health care professionals utilize them correctly. Inappropriate inferences about BP readings could result in unintended consequences that jeopardize a patient’s health. In fact, research (Chobanian et al, 2003) indicates most human errors when measuring BP result in higher readings. Therefore, these costly errors may result in misclassifying prehypertension as stage 1 hypertension and beginning a treatment program that may be both unnecessary and harmful to a patient. This problem is further exacerbated when physicians put a patient on high blood pressure medication, as most physicians are extremely reluctant to take a patient off the medication, as the risks associated with stopping are extremely high. Further, continued usage of poor BP measurement techniques could result in patients whose blood pressure is under control to appear uncontrolled, thus escalating therapy that could further harm a patient. Until physicians can obtain accurate BP measurements, it is unlikely they can accurately differentiate those individuals who may need treatment from those that do not. So, I wish to ask the measurement community how we might assist healthcare professionals (and those responsible for their training) to correctly practice proper blood pressure measurement techniques? What lessons from psychometrics can parlay into the everyday practice of healthcare providers? Contributing practical solutions to this problem could go a long way in directly improving patient health and outcomes. References Pickering T, Hall JE, Appel LJ, et al. Recommendations for blood pressure measurement in humans and experimental animals part 1: blood pressure measurement in humans – a statement for professionals from the Subcommittee of Professional and Public Education of the American Heart Association Council on High Blood Pressure Research. Hypertension. 2005;45:142‐161. Bickley LS, Szilagyi PG. Beginning the physical examination: general survey, vital signs and pain. In: Bickley LS, Szilagyi PG, eds. Bates’ Guide to Physical Examination and History Taking, 11th ed. Philadelphia, PA: Wolters Kluwer Health/ Lippincott Williams and Wilkins; 2013:119‐134. Chobanian AV, Bakris GL, Black HR, et al. Seventh report of the Joint National Committee on prevention, detection, evaluation and treatment of high blood pressure. Hypertension. 2003;42:1206‐1252. Rakotz MK, Townsend RR, Yang J, et al. Medical students and measuring blood pressure: Results from the American Medical Association Blood Pressure Check Challenge. Journal of Clinical Hypertension. 2017;19:614–619.
 Kenneth D. Royal and Melanie Lybarger The topic of automation replacing human jobs has been receiving a great deal of media attention in recent months. In January, the McKinsey Global Institute (Manyika et al., 2017) published a report stating 51% of job tasks (not jobs) could be automated with current technologies. The topic of ‘big data’ and algorithms was also briefly discussed on the Rasch listserv last year and offered a great deal of food-for-thought regarding the future of psychometrics in particular. Several individuals noted a number of automated scoring procedures are being developed and fine-tuned, and each offer a great deal of promise. Multiple commenters noted the potential benefits of machine scoring using sophisticated algorithms, such as power, precision, and reliability. Some comments even predicted humans will become mostly obsolete in the future of psychometrics. Certainly, there is much to get excited about when thinking about the possibilities. However, there remain some issues that should encourage us to proceed with extreme caution. The Good For many years now algorithms have played a significant role in our everyday lives. For example, if you visit an online retailer’s website and click to view a product, you will likely be presented a number of recommendations for related products based on your presumed interests. In fact, years ago Amazon employed a number of individuals whose job was to critique books and provide recommendations to customers. Upon developing an algorithm that analyzed data about what customers had purchased, sales increased dramatically. Although some humans were (unfortunately) replaced with computers, the ‘good’ was that sales skyrocketed for both the immediate and foreseeable long-term future and the company was able to employ many more people. Similarly, many dating websites now use information about their subscribers to predict matches that are likely to be compatible. In some respects, this alleviates the need for friends and acquaintances to make what are often times awkward introductions between two parties, and feel guilty if the recommendation turns out to be a bad one. The ‘good’, in this case, is the ability to relieve people that have to maintain relationships with each party of the uncomfortable responsibility of playing matchmaker. While the aforementioned algorithms are generally innocuous, there are a number of examples that futurists predict will change most everything about our lives. For example, in recent years Google’s self-driving cars have gained considerable attention. Futurists imagine a world in which computerized cars will completely replace the need for humans to know how to drive. These cars will be better drivers than humans - they will have better reflexes, enjoy greater awareness of other vehicles, and will operate distraction-free (Marcus, 2012). Further, these cars will be able to drive closer together, at faster speeds, and will even be able to drop you off at work while they park themselves. Certainly, there is much to look forward to when things go as planned, but there is much to fear when things do not. The Bad Some examples of algorithmic failures are easy to measure in terms of costs. In 2010, the ‘flash crash’ occurred when an algorithmic failure from a firm in Kansas who ordered a single mass sell and triggered a series of events that led the Dow Jones Industrial Average into a tailspin. Within minutes, nearly $9 trillion in shareholder value was lost (Baumann, 2013). Although the stocks later rebounded that day, it was not without enormous anxiety, fear and confusion. Another example involving economics also incorporates psychosocial elements. Several years ago, individuals (from numerous countries) won lawsuits against Google when the autocomplete feature linked libelous and unflattering information to them when their names were entered into the Google search engine. Lawyers representing Google stated "We believe that Google should not be held liable for terms that appear in autocomplete as these are predicted by computer algorithms based on searches from previous users, not by Google itself." (Solomon, 2011). Courts, however, sided with the plaintiffs and required Google to manually change the search suggestions. Another example involves measures that are more abstract, and often undetectable for long periods of time. Consider ‘aggregator’ websites that collect content from other sources and reproduces it for further proliferation. News media sites are some of the most common examples of aggregators. The problem is media organizations have long been criticized with allegations of bias. Cass Sunstein, Director of the Harvard Law School's program on Behavioral Economics and Public Policy, has long discussed the problems of ‘echo chambers’, a phenomenon that occurs when people consume only the information that reinforces their views (2009). This typically results in extreme views, and when like-minded people get together, they tend to exhibit extreme behaviors. The present political landscapes in the United States (e.g., democrats vs. republicans) and Great Britain (e.g., “Brexit” - Britain leaving the European Union) highlight some of the consequences that result from echo chambers. Although algorithms may not be directly responsible for divisive political views throughout the U.S. (and beyond), their mass proliferation of biased information and perspectives certainly contributes to group polarization that may ultimately leave members of a society at odds with one another. Some might argue these costs are among the most significant of all. The Scary Gary Marcus, a professor of cognitive science at NYU, has published a number of pieces in The New Yorker discussing what the future may potentially hold if (and when) computers and robots reign supreme. In a 2012 article he presents the following scenario: Your car is speeding along a bridge at fifty miles per hour when an errant school bus carrying forty innocent children crosses its path. Should your car swerve, possibly risking the life of its owner (you), in order to save the children, or keep going, putting all forty kids at risk? If the decision must be made in milliseconds, the computer will have to make the call. Marcus’ example underscores a very serious problem regarding algorithms and computer judgments. That is, when we outsource our control we are also outsourcing our moral and ethical judgment. Let us consider another example. The Impermium corporation, which was acquired by Google in 2014, was essentially an anti-spam company whose software purported to automatically “identify not only spam and malicious links, but all kinds of harmful content—such as violence, racism, flagrant profanity, and hate speech—and allows site owners to act on it in real-time, before it reaches readers.” As Marcus (2015) points out, how does one “translate the concept of harm into the language of zeroes and ones?” Even if a technical operation was possible to do this, there remains the problem that morality and ethics is hardly a universally agreed upon set of ideals. Morality and ethics are, at best, a work-in-progress for humans, as cultural differences and a host of contextual circumstances presents an incredibly complex array of confounding variables. These types of programming decisions could have an enormous impact on the world. For example, algorithms that censor free speech in democratic countries could spark civil unrest among people already suspicious of their government; individuals flagged to be in violation of an offense could have his/her reputation irreparably damaged, be terminated by an employer, and/or charged with a crime(s). When we defer to computers and algorithms to make our decisions for us, we are entrusting that they have all the ‘right’ answers. This is a very scary proposition given the answers fed to machines come from data, which are often messy, out-of-date, subjective, and lacking in context. An additional concern involves the potential to program evil into code. While it is certainly possible that someone could program evil as part of an intentional, malicious act (e.g., terrorism), we are referring to evil in the sense of thoughtless actions that affect others. Melissa Orlie (1997), expanding on the idea of “ethical trespassing” as originally introduced by political theorist Hannah Arendt, discusses the notion of ‘ordinary evil’. Orlie argues that despite our best intentions, humans inevitably trespass on others by failing to predict every possible way in which our decisions might impact others. Thoughtless actions and unintended consequences must, therefore, be measured, included, and accounted for in our calculations and predictions. That said, the ability to do this perfectly in most contexts can never be achieved, so it would seem each day would present a new potential to open Pandora’s Box. Extensions to Psychometrics Some believe the ‘big data’ movement and advances in techniques designed to handle big data will, for the most part, make psychometricians obsolete. No one knows for sure what the future holds, but at present that seems to be a somewhat unlikely proposition. First, members of the psychometric community are notorious for being incredibly tedious with respect to not only the accuracy of information, but also the inferences made and the way in which results are used. Further, it is apparent that the greatest lessons learned from previous algorithmic failures pertains to the unintended consequences, albeit economically, socially, culturally, politically, and legally that may result (e.g., glitches that result in stock market plunges, legal liability for mistakes, increased divisions in political attitudes, etc.). Competing validity conceptualizations aside, earnest efforts to minimize unintended consequences is something most psychometricians take very seriously and already do. If anything, it seems a future in which algorithms are used exclusively could only be complemented by psychometricians who perform algorithmic audits (Morozov, 2013) and think meticulously about identifying various ‘ordinary evils’. Perhaps instead of debating whether robots are becoming more human or if humans are becoming more robotic, we would be better off simply appreciating and leveraging the strengths of both? References Baumann, N. (2013). Too fast to fail: How high-speed trading fuels Wall Street disasters. Mother Jones. Available at: http://www.motherjones.com/politics/2013/02/high-frequency-trading-danger-risk-wall-street Manyika, J., Chui, M., Miremadi, M., Bughin, J., George, K., Willmott, P., & Dewhurst, M. (2017). A future that works: Automation, employment, and productivity. The McKinsey Global Institute. Available at: http://www.mckinsey.com/global-themes/digital-disruption/harnessing-automation-for-a-future-that-works Marcus, G. (2012). Moral machines. The New Yorker. Available at: http://www.newyorker.com/news/news-desk/moral-machines Marcus, G. (2015). Teaching robots to be moral. The New Yorker. Available at: http://www.newyorker.com/tech/elements/teaching-robots-to-be-moral Morozov, E. To Save Everything, Click Here: The Folly of Technological Solutionism (2013). PublicAffairs Publishing, New York, NY. Orlie, M. (1997). Living ethically, acting politically. Cornell University Press, Ithaca, NY. Solomon, K. (2011). Google loses autocomplete lawsuit. Techradar. Available at: http://www.techradar.com/news/internet/google-loses-autocomplete-lawsuit-941498 Sunstein, C. R. (2009). Republic.com 2.0. Princeton University Press, Princeton, NJ.