Output by Orbrya2026-03-31

Two Things Our AI Got Wrong in Episode 3 -- And One Thing It Got Right

Output Episode 3 changed one word in a RAND statistic and invented a shorthand for a researcher's framework. Here is what the data actually says and why catching both matters.

Episode 3 of Output covered metacognitive oversight: the three-part skill of planning, monitoring, and evaluating AI use that the research suggests most students never develop. The episode is accurate on the core framework. But two things are worth correcting before you pass any of this along to another parent.

What the episode said about RAND

Near the end of the episode, one host said this:

A 2025 survey from the RAND Corporation found that over 80% of students say their teachers have never explicitly taught them how to evaluate AI outputs.

The RAND finding is real and the 80% figure is accurate. The word that changed is "evaluate."

RAND's actual wording is: "Over 80 percent of students reported that teachers have not explicitly taught them how to use AI for schoolwork."

Use. Not evaluate.

That is a meaningful difference. RAND measured whether teachers taught students how to use AI tools at all: prompting, basic operation, appropriate contexts. They did not measure whether teachers taught students to evaluate, audit, or interrogate AI outputs specifically. That more specific finding does not yet exist in the research literature at scale, which is one of the reasons Orbrya's curriculum is built around it.

The episode took a real data point and quietly upgraded it to support a more specific claim than the data makes. The statistic still supports the episode's argument: schools are failing to teach students what they need to know about AI. But the precise gap the research documents is about basic usage instruction, not evaluation instruction. Those are related but not the same.

How you would have caught it

The third verification question from Episode 1: what would change if it turned out to be wrong?

In this case, go to the RAND report directly and search for the exact wording around the 80% figure. The report text reads "how to use AI for schoolwork." The word "evaluate" does not appear in that finding. Four minutes of checking catches a one-word substitution that changes the specificity of a claim.

This is also a demonstration of why primary sources matter more than secondary summaries. Multiple outlets reporting on this RAND finding have paraphrased it in various ways, some of which drift further from the original than the episode did. The only reliable check is the report itself.

The F4R shorthand

The episode also introduced the shorthand "F4R" when describing Chahna Gonsalves' framework from King's College London. Gonsalves herself does not use this shorthand. Her paper identifies two types of critical thinking when working with AI: critical thinking for the assignment and critical thinking toward the AI. The episode's hosts invented F4R as a conversational abbreviation to make the concept more discussable without repeating the full phrase.

This is not an accuracy error in the same category as the RAND word swap. The underlying concept is correct. But if you search for "F4R Gonsalves" expecting to find her original work, you will not find it. The shorthand belongs to this episode, not to the research.

What the episode gets right

The Gonsalves distinction itself is accurate and well-explained. Critical thinking for the assignment versus critical thinking toward the AI is the correct framing from her 2024 paper in the Journal of Marketing Education. The Phung et al. finding about the 8% evaluation rate is presented with its scope caveats intact: 102 students, one Python course, not a universal finding. The Mollick jagged frontier data is correctly scoped to BCG consultants using GPT-4 in 2023. The three questions are practical and grounded in the metacognitive framework the episode describes.

The RAND word swap is a single word in a single sentence of an otherwise solid episode. It is worth noting precisely because the rest of the evidence base is handled carefully. One word matters when the word changes what a nationally representative survey actually measured.

Why we publish these

Output is produced using AI and reviewed by humans before publishing. The RAND word change is an example of the specific error type that human review should catch and sometimes does not: not a fabricated statistic, not a wrong number, but a plausible upgrade of a verified finding to a slightly stronger claim. The AI generated it fluently. It passed a surface reading. It required going to the source to catch.

That process -- going to the source to catch what a surface reading misses -- is what Episode 3 spent eighteen minutes teaching. The correction post is the demonstration.

If you want correction posts delivered alongside new episodes as they publish, the waitlist is at orbrya.com.

Sources

Doss, C.J. et al. (2025). AI Use in Schools Is Quickly Increasing but Guidance Lags Behind. RAND Corporation. https://doi.org/10.7249/RRA4180-1
Gonsalves, C. (2024). Generative AI's Impact on Critical Thinking: Revisiting Bloom's Taxonomy. Journal of Marketing Education. Sage Journals. https://doi.org/10.1177/02734753241305980
Phung, T. et al. (2025). Plan More, Debug Less. AIED Conference. https://arxiv.org/abs/2509.03171
Dell'Acqua, F., McFowland, E., Mollick, E. et al. (2023). Navigating the Jagged Technological Frontier. Harvard Business School Working Paper 24-013. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4573321

Output Episode 3 is available on the Orbrya YouTube channel. The paired blog post on metacognitive oversight is available on the Orbrya blog.