Data science teams are shifting their focus from model development to dataset development in order to deliver Machine Learning (ML) and Artificial Intelligence (AI) initiatives that are more performant, differentiated and aligned with business goals. This and other findings are available in the first Label Studio Community Survey, where data scientists, ML engineers and researchers from the global open source community shared insights into the state of ML and AI.
Key Findings in the Label Studio Community Survey
- Machine Learning and AI are becoming increasingly strategic: 73% of respondents noted their organizations will make a higher level of investment in their ML/AI initiatives in the coming year.
- Data poses the biggest challenge to putting ML/AI models into production: 80% of respondents state that accurately labeled data is one of the biggest challenges to getting ML/AI models in production (the top response), while 46% cited lack of data as one of the biggest challenges (the second most popular response).
- Data science teams now spend the majority of their time on dataset preparation, management and iteration, known as dataset development: 72% of respondents reported spending 50% or more of their time on data preparation, iteration and management, while more than one-third (34%) of respondents said they spend 75% or more of their time on the data.
- Data preparation and labeling are becoming increasingly cross-functional: While most respondents have the traditional roles of data scientists and data engineers, the responsibility for data labeling is broad, requiring engagement across organizations from interns to executives and business leaders. Notably, 20% reported that a mix of roles held the data prep responsibility, including subject matter experts, who accounted for 5% of responses, and business analysts, who accounted for 3%.
Successful ML and AI applications rely on models trained using high quality data. The 2022 Label Studio Community Survey explores the current state of the ML/AI ecosystem, with a focus on how teams are approaching data labeling, preparation and management as a key part of the pipeline.
The Label Studio Community Survey also dives into popular technology choices, finding that ML/AI workloads are primarily hosted on cloud offerings, while HuggingFace is the most popular source for pre-trained models.