Roee Shraga, assistant professor of computer science and data science at Worcester Polytechnic Institute (WPI), has received $175,000 from the National Science Foundation to scrutinize the human aspects of data discovery and integration. The research project aims to explore the critical role of human involvement in data preparation processes to identify and address biases that automated systems may fail to detect.
“We’ll be able to create the future framework that will be better for users,” Shraga said. “It will lead to better data, better data sets, and a better user interface for people searching for the data.”
Shraga said he will study the roles of humans as labelers, prompters, and validators in the rapidly growing AI space, and will uncover biases using cognitive psychology literature and technology to understand how humans think in data discovery. Implicit biases of computer scientists, coders, and others who build AI platforms can influence algorithms in the technology, resulting in undetected or unintended discrimination.
The two-year study will also look at how the emergence of large language artificial intelligence models like ChatGPT may actually require more, not less, human involvement to ensure quality results. The concept, referred to as “human-in-the-loop,” looks at how the human perspective fits into machine learning and large language artificial intelligence models like ChatGPT.
A key focus of the grant is the “table union search,” a way for scientists to expand datasets by finding additional sources online. In healthcare, for example, a researcher may do a table union search for more aggregated or de-identified patient data to get more robust, reliable results.
These methods are not new, but the process often lacks follow-up to determine whether the additional data actually benefitted the user. A better process that combines “human-in-the-loop” interaction and artificial intelligence could give researchers more data they can use, Shraga said.
Shraga said his study is also looking at how quirks of large-language models, like their tendency to “hallucinate” or produce incorrect but plausible data, can actually be used to researchers’ advantage. The ability to generate realistic but not real data tables is important in applications where real data could have privacy implications.