DS Ph.D. Dissertation Proposal | Dongyu Zhang | Tuesday, Nov. 28th @ 10:00AM
10:00 am to 12:00 pm
MA
United States
DATA SCIENCE
Ph.D. Dissertation Proposal
Dongyu Zhang, Ph.D. Candidate
Tuesday, Nov. 28th, 2023 | 10:00AM - 12:00PM EST
Location: Campus Center, Mid-Century Room
Dissertation Committee:
Dr. Elke A. Rundensteiner, Worcester Polytechnic Institute, Advisor
Dr. Xiangnan Kong, Worcester Polytechnic Institute
Dr. Nima Kordzadeh, Worcester Polytechnic Institute
Dr. Liang Wang, Visa Research
Title: Learning with Incomplete, Inaccurate, and Multi-level Labeled Data
Abstract:
Deep learning models excel in various tasks but require large amounts of accurate labels. Unfortunately, acquiring quality labels is costly and requires domain expertise. Hence, datasets tend to have missing or noisy labels. Additionally, data might be labeled on multiple levels. For instance, in detecting foodborne illness incidents from a tweet, the aim at the tweet level is to predict illness indication, while at the word level, it is to identify relevant slots like location or food group. However, both levels may have missing and noisy labels with label quality and completeness potentially varying across levels.
This dissertation explores three directions for handling incomplete, noisy, and multi-level labeled data. Direction 1 aims to learn from two-level task datasets where one task has complete labels and the other has incomplete labels. We propose a novel solution that integrates joint learning of tasks at both levels and strikes a balance between the fully labeled and incompletely labeled tasks. Direction 2 focuses on learning with noisy labeled data. We propose a method that harnesses the Local Intrinsic Dimensionality (LID) score to detect and correct noisy labels. Direction 3 aims to learn with two-level labeled data exhibiting both incomplete and noisy labels. We plan to capitalize on the relationship between tasks and integrate weak labels obtained from Large Language Models (LLMs) to achieve better performance.
To validate the effectiveness of our proposed methods, we have conducted preliminary experimental studies on real-world domains comparing them with state-of-the-art methods. Our experimental results demonstrate that our proposed methods outperform state-of-the-art methods across these label-related challenges.