Engineering Metrics Now Include 'Artifact Evaluation'
The measurement of engineering productivity is evolving to include more rigorous validation of outcomes. Major software engineering conferences are now decoupling artifact evaluation from paper acceptance, focusing on reproducibility and real-world usability. This mirrors a trend in platform teams to move beyond DORA metrics and validate the quality and impact of engineering work through more direct evaluation.
- The practice of artifact evaluation was first introduced by the software engineering community at the ESEC/FSE conference in 2011 to validate research claims and encourage reusable artifacts. - Conferences like ETAPS 2026 now award badges based on ACM guidelines, such as "Artifacts Evaluated," "Artifacts Available," and "Results Validated," to recognize the quality and accessibility of the accompanying materials. - The move beyond DORA is driven by its limitations, as the metrics can lack business context, oversimplify complex work, and encourage a focus on speed over quality and customer value. - To get a more holistic view of engineering productivity, organizations are adopting frameworks like SPACE, which stands for Satisfaction and Well-being, Performance, Activity, Communication and Collaboration, and Efficiency and Flow. - AI is significantly enhancing SRE and DevOps workflows by reducing alert noise by 40-60% and cutting Mean Time to Recovery (MTTR) by 50-70% through intelligent monitoring and faster root cause analysis. - AI agents are shifting operations from being reactive to proactive by learning from historical incidents, detecting anomalies before thresholds are breached, and suggesting automated remediations. - The adoption of artifact disclosure in top-tier software engineering conferences has seen a significant increase, with the percentage of publications including artifacts growing from approximately 60% in 2017 to over 81% in 2022. - A 2023 study on the reproducibility of software defect artifacts found that over 62% of them broke at least once over a 13-month period, highlighting the challenges of maintaining long-term usability and the importance of formal evaluation.