WestEd’s Leading Together Webinar Series: From AI Risk to AI Readiness Transcript

Featured Speaker

Sarah Quesen, Director of Assessment, Research, and Development, WestEd
Baron Rodriguez, Vice President of Client Facing Technology and Artificial Intelligence, WestEd

Host

Danny Torres, Associate Director of Events and Digital Media, WestEd

Danny Torres:

All right, let’s get started. Hello, everyone, and welcome to the 25th session of our Leading Together series. In these 30-minute learning webinars, WestEd experts are sharing research and evidence-based practices that help bridge opportunity gaps, support positive outcomes for children and adults, and help build thriving communities. Today’s topic, “From AI Risk to AI Readiness.” We’ll be sharing a framework for responsible AI integration in public agencies. Our featured speakers today are Sarah Quesen, Director of Assessment, Research, and Development at WestEd, and Baron Rodriguez, Vice President of Client Facing Technology and Artificial Intelligence. Thank you all very much for joining us. My name is Danny Torres. I’m Associate Director of Events and Digital Media for WestEd. I’ll be your host.

Now, before we move into the contents of today’s webinar, I’d like to take a brief moment to introduce WestEd. As a non-partisan research, development, and service agency, WestEd works to promote excellence, improve learning, and increase opportunity for children, youth, and adults. Our staff partner with federal, state, and local agencies, providing a broad range of tailored services, including research and evaluation, professional learning, technical assistance, and policy guidance. Now, I’d like to pass the mic over to Baron. Baron, take it away.

Baron Rodriguez:

Thank you, Danny. Good day, everyone. On behalf of WestEd’s Data Integration Support Center, or DISC, welcome to our session, “From AI Risk to AI Readiness.”

For today’s brief chat, we’ll be discussing organizational readiness for AI. This includes discussions around assessing agency capacity, governance, techniques such as prompt engineering, protection of sensitive data, and a discussion around financial and technical resource considerations.

Before we begin discussing all of the serious considerations with generative AI, let’s talk about the promise of the technology itself. We are already seeing great examples of use cases around operational efficiencies. Examples of these include the ability to conduct detailed initial reviews of previously very manual tasks, such as school improvement plan alignment checks against established rubrics. To be clear, all use cases that I am aware of with this particular scenario still have human checks and sign-off, but a majority of the lift is a fairly manual process that has been essentially eliminated, saving hundreds of hours of public employee time. In one example, temp staff were brought in for this initial review, which will no longer be part of the process in future years. So they had temp staff come in. And now, they won’t need to do that.

Another particularly strong opportunity is around code development. Tools, such as Claude Code or GitHub Copilot, have had substantial impact for already resource-constrained government technologies. Similarly to the school improvement plan use case, human review, or what we call QA, or quality assurance, roles have become critical. Common areas to keep an eye on in the use of these tools around code development are ensuring proper connection to data sources and derived data checks.

One thing I’ve consistently heard from executive leadership groups in the sector is, “We’re getting behind. We need to quickly pivot to this new technology, or we’ll miss an opportunity.” But hold the train. A recent survey of state agencies from the National Association of State Chief Information Officers, or NASCIO, tells a different story, and you can see that up on your screen. Most state agencies are in the development of responsible, ethical use policy development, inventorying AI software, and looking at potential AI use cases, as well as developing the appropriate governance structures around those.

And if you look here, you can see that there are a variety of different areas that they are focusing on. And you may ask yourselves, “Why? Why not move as fast as the private sector or like some local school districts?” The answer centers around a few particular areas. Regulatory, transparency, communication, quality concerns, and costs are all things that have to be considered. For example, the free versions of enterprise generative AI tools, such as Copilot or ChatGPT, do not have the same level of consistency, features, or privacy protections. This is usually staff’s first experience with these tools. This can create an unrealistic expectation of not only whether the tool produces quality results, but also an unrealistic self-certification of AI expertise. In highly regulated public agencies, the risks are too great to jump into these projects, without ensuring the value, safety, and potential perception issues.

Let’s now discuss those risks that need to be considered. And as we go through these, do an honest self-assessment of where your organization is with each of these areas.

Legal implications, states such as Utah and North Carolina have developed AI governance councils and executive leadership. New York enacted a law that requires state agencies to publish detailed information about their automated decision-making tools.

On the privacy and security side, commonly used protocols within agent-based protocols potentially maintain persistence of credentials across multiple systems, and may keep persistent data, so that the AI agent can remember the individual to simplify future interactions. However, there’s no commonly utilized procedures for those individuals who don’t want that interaction stored in perpetuity, or may want that interaction record deleted.

Training is always the biggest bang for your buck when it comes to organizational AI readiness. And remember, your leadership needs training as well. Don’t assume that they understand the risks, implications, and/or requirements that may be an issue. It’s a very common mistake.

Operational readiness, including openness to changing the way your org operates, or even potentially changing the roles and types of staff needed to support these initiatives, are considerations as well.

Data readiness is absolutely critical. And that could actually be its very own webinar topic. I was thinking of that as I was putting this together. That would be an hour of conversation. The short advice is, if you don’t have good-quality data, structured appropriately, with governance structures to support the work, don’t bother using AI. It’ll only get you a wrong answer faster.

Lastly, do you have an idea of what tools are being used in your agency? Have you done an inventory? Or those that have been introduced by your existing vendors, they’re sticking AI pretty much on every application now, so do you have an idea of what those do and what the impact is for your agency?

So Sarah, let’s transition to what you are seeing around the strengths and weaknesses around AI capabilities.

Sarah Quesen:

Cool. Thanks, Baron. So let’s talk about what these AI models can do, ’cause I think if we understand what they’re good at and what they’re terrible at, then we’re already ahead of a lot of people trying to use these tools. And I’ll be the first to admit that I am tragically Gen X. And I grew up with “Star Trek” and “Jetsons.” And buying into the hype, when these models first came out, I kind of wanted to yell like, “Earl Grey, hot,” at the wall and get like hot tea back. They can’t do sci-fi yet.

Our current models are great at pattern recognition, they can summarize texts, they can follow pretty complex instructions, and they’re great at giving a structured output, but they’re terrible at knowing what they don’t know. And the biggest problem that I have is, they’ll confidently make up facts, including numbers, math, sources. I have a friend who was asked to provide a paper from a colleague, and it was a made-up citation from AI. And actually, my friend is someone who knows how to program these local models. They struggle with math reasoning, unless you’re very careful. And without access to current information, a lot of them ended their training at a certain time. Like the ChatGPT, the publicly available models, many pre-trained models are working from old knowledge.

So when we work with these realities rather than against them, and we sort of face what they’re good and what they’re bad at, we won’t ask AI to do things that it can’t do, and then we won’t get mad when it fails. That’s like being mad at a bicycle for not being a car, right? It’s a tool that does a thing. Let’s leverage the thing that it can do.

So the most overlooked, I think, piece of interacting, and I’m talking about there’s a lot of, AI means a lot of things, but I’m talking about public large language models, chatbots, ChatGPT, Claude, Gemini, Copilot, that type of a tool. Most of our success with AI comes from how we ask the question. So prompt engineering sounds fancy, it has the word engineering in it, but really, it’s just learning to be specific and clear. It’s an art, really, of telling AI exactly what you want, how you want it, and oftentimes giving it some examples of what a good response would look like.

So when I think of these large language models, I think of them like graduate assistants or interns. You have to give very clear direction, and then fact-check the output. But if you do give good information to the model, and they are good at the task at hand, it is extremely helpful and can be a huge time saver.

So prompt engineering, provide clear instructions, be specific about what you want, don’t make it guess. Write clear prompts, tell it how to behave, what role it’s playing, what rules to follow, and you set the context up front. Oftentimes, I’ll say, “You’re an expert in statistics who is working on R code or some software code to,” dot, dot, dot, and then explain my problem. And then give some examples, show it what something good looks like, if it’s appropriate for your context. If you send an example with the prompt, oftentimes, that’s called “Few-Shot Learning,” you give it a few examples along the way. It’s like giving someone a sample before you ask them to do the work. It can shape the tone, the style, the content, and gives you better products.

So here’s a prompt engineering example. One thing that my use case at WestEd, and I was one of the early sort of pushers for bringing on this technology for solving some problems, was item writing. I work in assessment. Here is a prompt engineering not good and much better. So write a fifth-grade science question. What do you get back? Could be anything. The reading level could be wrong. The content might miss your standard. And oftentimes, the response is kind of trite and very stereotypical. But if we look at the specific version, we tell it, “Make it this type of question, align it with a standard, give me this many response options, make the wrong answer plausible, adjust the reading level. I wanna measure this thing.” It’s the same idea, I’m trying to get the same output, but the second prompt is just much better. It might take an extra minute or two to be specific about your prompting, but it will save you 20 minutes of editing garbage out of it and trying to go back and forth to get what you want.

So if you’ve gotten prompt engineering down, and there’s still more that you want from this model, things that are trained on publicly available data, obviously, don’t have access to your stuff. And if you would like a chat box that can answer your things or work with your data, the next option, sort of the next down the line, for working with these models is called a retrieval augmented generation, or RAG, model. So don’t let the name scare you, it just means that you are giving the AI access to your specific documents in order to answer your specific questions. So when someone asks you the question, the system searches your knowledge base, finds the relevant chunks, and injects it into the prompt, so the AI can answer with your information and not just as training data. So for example, imagine a chatbot that can answer questions at your organization using the HR policies and your corporate policies. So if someone asks about sick leave, it can pull the actual HR policy and cite it in the answer. A publicly trained LLM could give you some generic advice about how sick leave works, but when you have your information, it can search your information to provide that answer. So in our case, when we’re doing assessment development, I might include the state’s content standards, item writing guides, specifications. A lot of states have topics to avoid and particular things about their state. And we can provide that to the item writing app to get better responses.

And I’ll add one more thing about assessment development before we get to fine-tuning. I am a firm advocate in that AI is a starting point and not an ending point. So I’m not recommending that you generate test content by just giving it a better prompt, and then taking what it gives. You still need human touch, you just get a much better starting place.

The most advanced thing, but the thing that most people think of when they think of, like, “Let’s work up and spin up a large language model to do our thing,” is fine-tuning. And that’s when we take a pre-trained model that knows general stuff, and we add our own curated training material, and get a specialized model that’s better at our specific task. So the pre-trained model is like a skilled generalist, and fine-tuning trains it to become a specialist in exactly the kind of work you need done. And you can get it done consistently. It learns patterns from your example that adapts to your style, your requirements, your edge cases. So in an assessment, you might see this with constructive response or essay automated scoring. We give it a whole bunch of human-scored essays, and then it picks up on how to do the thing. And it’s a mimic. This can be an expensive option, both in time and money, but if you need it, it’s well worth it. And just an asterisk here for more advanced folks on the call. This is a simplified version. There’s lot of other types of models. I’m really sticking with sort of the GPT pre-trained models, but there’s local models, there’s lots of different ways to go about this. So trying to get just to the basic big picture here.

So when should you fine-tune? In my opinion, after you’ve exhausted the simpler approaches. A lot of folks talk to me about, “Hey, I wanna do this thing, and I need to train this model.” But really, they’re just terrible at prompting it. So after you’ve tried everything else, if it makes sense, I think fine-tuning is oftentimes a really good solution. So in our case, we’re creating secure, standards-aligned reading passages for high-stakes summative assessments. So that’s a specialized enough task. And the quality bar is very high. And we have a lot of examples that fine-tuning makes sense.

So, and Danny helped me make this look nicer, so thank you. A decision tree, I have this here to kind of, if someone comes to you, they’re like, “Hey, I wanna do this big AI thing,” maybe you can walk them through what to consider before you jump straight to getting a custom model. So start at the top. Do you need a very specialized thing? Do you need your specific data? So if it’s something where you can just get there with good prompting, try. If you wanted to access your data, these RAG models are very good at that. And then if it’s still not giving you what you need, and you have a repeatable task, and you have a lot of high-quality training data, give fine-tuning a shot.

Okay, so that’s a lot about when we can use AI. When shouldn’t we use AI? Also, in my opinion, almost always, in fact, I would say always, but I’m a statistician, so I never say always or never, except I just did. But almost always, you should have a human overseeing these processes, but don’t rely on it for high-stakes decisions, for legal accountability, for issues where you need human judgment. It can be biased. If it’s not carefully monitored, these models can perpetuate and amplify stereotypes. It’s trained to give you the most likely response, the most typical response. And sometimes that is not the response you’re looking for. Don’t use it when you can’t admit you’re using it. Don’t clandestinely use it. If you can’t tell someone how decisions are being made, you shouldn’t be making them that way. Don’t use it when human relationships are the point, like crisis responses, IEPs, counseling. Don’t use it with highly sensitive data, without a thorough risk assessment. And if it’s prohibited by state or federal regulations, obviously don’t use it. And so the thorough risk assessment and when to and when not to is a good segue back to Baron, who’s gonna help us think this through.

Baron Rodriguez:

Thank you, Sarah. As you consider your use cases for AI, here are some pointers and considerations around privacy, ethics, and governance.

As discussed earlier, persistence of data is top of mind lately. Some other areas, such as your vendor terms of service, review of AI outputs, accountability, and AI incident plan, such as what process will be used to notify appropriate parties in the event of incorrect data or sensitive data privacy issues. What are your procedures for communicating to the public, especially if it’s required by your state? Does your state have opt-out options? Regardless, do you want to offer that? With your specific AI implementation, is that even possible, like if it’s built into your vendor system, for instance? Training, staff testing, output quality, assurance plan, and staff resources to implement that plan, customization of tools or the model, potential limitations on what is called token utilization by tools, processing, license costs, communication strategies, getting input from impacted individuals, and failed experiments are critical inputs to your AI implementation, training, and planning processes.

In many ways, building out the decision-making body’s scope, processes, and representation is your most critical step. I’ve been in conversations where the AI work began before there were even any of the policies, procedures, or the data were in place. And those continue to be mediocre at best in organizations that begin with the technology first. The nice part is, though, all is not lost, you can dual-track some of these tasks, such as defining what problem you’re trying to solve with a technology. You might even discover, after thorough analysis, to not use AI, but use a more proven technology, with the right set of leaders and experts providing input. Make sure you start with some easy wins. Don’t start with the most complicated problem. Teams need to spend time getting familiar with how to operate with AI tools effectively, and get a strong understanding of the strengths and weaknesses of the tool. Lastly, and very importantly, how will you evaluate the effectiveness of the tool, by what measures?

The good news for you is that the Data Integration Support Center, DISC, at WestEd does have assistance available in strategic use-case development, assessment of AI capacity readiness, providing an AI regulatory landscape review, providing training and professional development, facilitation of executive leadership with AI strategy development.

On the next slide, we’ll have our contact information. And if you need assistance from DISC, or wanna follow up with either speaker today, you can take a look at that. But thank you for being here today. And back to you, Danny.

Danny Torres:

All right, well, thank you, Sarah and Baron, for a great session today. And thank you to all our participants for joining us. We really, really appreciate you being here. Please feel free to reach out to Sarah and Baron via email if you have any questions about the work we’ve discussed today. You can reach Sarah at [email protected], or you can reach Baron at [email protected]. To learn more about our Data Integration Support Center, or DISC, at WestEd, visit disc.wested.org, or you can scan the QR code displayed on the screen here. And you can check out recordings of our past Leading Together webinars online. We’ve covered a range of topics, including literacy, assessment, special education, mathematics, and other sessions on artificial intelligence. To access our Leading Together webinar recordings, visit us online at wested.org/leading-together. And finally, if you’re interested in learning more about WestEd and staying connected with us, you can sign up for WestEd’s email newsletter to receive updates on research, free resources, services, and more. Subscribe online at wested.org/subscribe, or scan the QR code displayed on the screen here. You can also follow us on LinkedIn and Bluesky. With that, thank you all very, very much. We’ll see you next time.