On Thursday, October 18, 2018, the Broadway League released its latest report on the demographics of the Broadway audience during the 2017-2018 season. One number in the report and press release caught many people’s attention: “The average annual household income of the Broadway theatregoer was $222,120.” The New York Times said that this made the audience “quiet affluent.” One commenter on Deadline‘s new story said that this meant Broadway attendance “has become an elitist activity.” Several people on Twitter were shocked at this number.

It’s not hard to see why this number raise eyebrows. A household income of $222,120 exceeds the annual income of 95.5% of households in the United States. That would mean that the average Broadway audience member isn’t a member of “the 1%,” but they’re almost that rich.

But a closer look at the Broadway League’s numbers show that picture is a little more complex than that. 56.6% of audience members have a household income less than $150,000. By my estimate, about 66.3% earn less than $200,000. This would mean that the average income is higher than the income of two-thirds of the Broadway audience. To the untrained eye, something strange is happening in the data. What is it?

Pie chart showing the household income distribution of Broadway audience members in the 2017-2018 season. Source: The Demographics of the Broadway Audience, 2017-2018, p. 30.

The answer is in the distribution of income levels. The Broadway League reported that distribution as a pie chart, pictured at left. This shows that the Broadway audience is certainly more affluent than the American public at large (where 50% of households earn $56,000 or less annually). But if over half the audience earns less than $150,000 (as indicated by adding up the percentages of the bottom five income groups), it might be confusing to learn that the Broadway League believes that their audience members’ average household income is $222,120.

A chart that I created—called a histogram and shown below—clarifies the issue. Like a bar graph, each bar on the histogram represents the percentage of people who are in a category. On this histogram, the categories are income groups. You will notice that some income groups in the histogram don’t have bars represented in them. This is because the Broadway League’s income categories on the survey do not represent equal intervals of income. Notice how, in the legend in the pie chart, the first four categories span income groups of $25,000 each, but then the ranges of income for each category are larger for higher income levels.

Histogram showing the household income distribution of Broadway audience members in the 2017-2018 season. Data derived from The Demographics of the Broadway Audience, 2017-2018, p. 30.

The histogram, though, spaces out categories in $50,000 increments. This shows how most audience members have income levels bunched up at the lower end of the distribution (over one-third earn less than $50,000 annually), while a small number of percentage occupy the upper echelons of income. Statisticians call this pattern of bars on a histogram positive skewness.

Positive skewness distorts averages because sample members who are very far from the typical sample member have a disproportionate influence on the average and pull it up towards them. That top 5% of Broadway audience members (who earn $1 million or more per year)—and, to a lesser extent,  other high income groups—have so much mathematical influence on the average that the final estimate for the average ($222,120) ends up much higher than the typical audience member’s household income. The fact that skewness distorts averages is a well known artifact of skewed data, and I explain it in my introductory statistics textbook.

When data are skewed, a better measure of a “typical” sample member is the median, which you may remember from high school math as being the middle score. In the Broadway League’s income data, the median household income is somewhere between $100,000 and $150,000. By making a few statistical assumptions, I can estimate that the median household income is about $131,461. This corresponds to a household at the 84th percentile (i.e., top 16%) of household incomes. That’s well-off, but not obscenely wealthy.

Similar problems with skewed data are also apparent in the Broadway League’s data on Broadway show attendance. The average number of days that people purchased their tickets in advance was 43, but the median is about 14 days. Again the average is distorted because a minority of people (mostly tourists and other people who have to plan in advance) purchase their tickets much earlier than the typical Broadway audience member. I can’t prove it, but I suspect a similar statistical phenomenon occurs when comparing average and median ticket prices. The average paid ticket price was $146 (for a ticket with face value of $123.07). But a minority of very expensive tickets is probably dragging that average up.

There are a few lessons in this example. The first is to be careful with averages because they may be distorted if the data are heavily skewed. Second, I wish that journalists would examine data reports, instead of reproducing a bullet point from a press release. Press releases are often oversimplified, and that can lead to misunderstandings. (Trust me, I’ve written some about my professional research, so I know how press releases sometimes condense a little too much.) More specifically, Broadway attendance is clearly within the reach of many Americans. While it would be nice for ticket prices to go down, they are not yet at the point where only the rich can afford to attend Broadway.

Methodological notes: The Broadway League does not describe exactly how they calculated an average income for their sample. What is most likely is that they took the midpoint of each income category and multiplied it by the number of respondents and divided this value by the total number of respondents. When I used a similar method (with percentages, instead of respondents, because I did not have exact respondent counts for each income category), I got an average income of $220,950. This is slightly lower than the Broadway League’s estimate, but the differences are almost certainly due to rounding error in the percentages. The Broadway League used this exact methodology for estimating the average number of Broadway performances that respondents had attended in the previous year. Therefore, it seems likely that this was how they calculated the average household income.

For my median household income estimate of $131,461, I assumed that the distribution within the $100,000-$149,999 income category was evenly distributed. 38.8% of respondents were in lower income categories, and 17.8% were in the $100,000-$149,999 income category. To find the estimated median (which would be where 50% of respondents were equal to or below that value), it is necessary to find the point where the next 11.2% of individuals would be in the distribution of incomes (because 38.8% + 11.2% = 50.0%). This is found by dividing 11.2% by 17.8%, which produces a percentage of 62.9%. Thus, the overall median for the entire sample would be about 62.9% into the $100,000-$149,999 income category, which is $131,461.

One of the reasons there is some uncertainty with average and median income estimates is that the Broadway League collected categorical income data by asking people to check a box, instead of reporting their exact income. This is typical practice in survey research because most people don’t feel comfortable reporting exact income, even in an anonymous survey. But categorical data cannot be used to calculate averages, especially when categories are not evenly spaced. The methodology described in the previous paragraphs gets around this limitation by assuming that within each category income is distributed perfectly evenly. This is probably not completely realistic, but it probably doesn’t distort the result by much.

A bigger distortion is the fact that the top income category is labeled “$1,000,000 or More”. The Broadway League’s calculation methodology almost certainly assumes that all of these people’s households earn exactly 1 million dollars annually. This is called censoring the data, and it leads to an underestimate of averages because it sets a limit on how much impact outliers can have on the average. It’s impossible to know how much censoring the data distort the estimate for average income, but uncensored data could plausibly result in an average household income estimate of $320,950 (based on the assumption of a group midpoint of $3 million dollars).

The Broadway League’s survey has other data collection methodology problems. One is that the question about how far in advance the respondent purchased their ticket does not consist of mutually exclusive categories. For example, both “1-2 weeks ago” and “2-4 weeks ago” are options. But if someone purchased their ticket exactly 2 weeks ago, it is not clear which category they should select. This makes the data hard to analyze. On the other hand, the sampling method is excellent, and with a 56% response rate for the surveys, the quality of the data is better than what is often seen in survey research.