FA2024 DS/CS/AI Seminar ft. Prof. Athanassoulis | Oct 24 @ 11am, Higgins House | AI for Robust Data Systems
11:00 am to 12:00 pm
Sent on behalf of the FA2024 DS/CS/AI Seminar Series in partnership with the DS & AI Council
The FA2024 DS/CS/AI Seminar Series, in partnership with “Sips, Snacks, and Data Chats,” will take place on Thursday, October 24th from 11:00AM – 12:00PM in Higgins House, Great Hall. Please see details below regarding his presentation. We would love for you to join us!
Title: AI for Robust Data Systems
Abstract: ML and AI components are being introduced in data systems replacing or augmenting their traditional counterparts, for example in query optimization, indexing, and query evaluation. A common theme across such efforts is to use specific information about the workload and/or the execution environment to quickly tailor the system to it. In this work we take a different spin on ML-augmented data systems. Specifically, we ask the question "what if the workload or the setup information comes with some uncertainty?" In other words, we investigate the benefits we can achieve if the workload or the underlying setup is different than what we originally expected. We focus our attention on two main modules of a data system. First, we investigate tuning a storage engine that uses log-structured merge (LSM) trees as its core data structure. LSM-based storage engines are widely used today for write-intensive applications. We show that nominal tuning (the one that assumes accurate knowledge of the workload) leaves performance benefits on the table, and we propose a new framework for "robust tuning" that leads to much better results in the presence of uncertainty. We further discuss new research avenues that are opened via this research including percentile optimization and learned cost models. Second, we investigate one of the most common operators: joins. We observe that state-of-the-art hash join algorithms are not designed to use the information of key multiplicity (how many matches with the other table each key has), thus, working under the assumption that the best partitioning strategy is to create partitions of equal size. We show that knowing key multiplicity the optimal partitioning is not equi-sized and develop a practical algorithm that outperforms the state of the art for both in-memory and on-disk execution for various "shapes" of key multiplicity. If time permits, we will also briefly discuss the struggles of update-aware learned indexes with data sortedness.
Bio: Manos Athanassoulis is an Assistant Professor of Computer Science at Boston University, Director and Founder of the BU Data-intensive Systems and Computing Laboratory, and co-director of the BU Massive Data Algorithms and Systems Group. He also spent the summer as a Visiting Faculty at Meta. His research is in the area of data management, focusing on building data systems that efficiently exploit modern hardware (computing units, storage, and memories), are deployed in the cloud, and can adapt to the workload both at setup time and dynamically, at runtime. Before joining Boston University, Manos was a postdoc at Harvard University. Earlier, he obtained his PhD from EPFL, Switzerland, and spent one summer at IBM Research, Watson. Manos’ work has been recognized by awards like “Best of SIGMOD” in 2016, “Best of VLDB” in 2010 and 2017, “Most Reproducible Paper” at SIGMOD in 2017, "Best Demo" for VLDB 2023, and "Distinguished PC Member" for SIGMOD 2018, 2023, and 2024, and has been supported by multiple NSF grants including an NSF CRII and an NSF CAREER award, and industry funds including a Facebook Faculty Research Award, multiple Red Hat Research Incubation Awards and gifts from Cisco, Red Hat, and Meta.
Research Interests: data systems, lsm trees, storage, indexes, tuning data systems, sw/hw co-design for data-intensive systems, learned cost models