By Charalambos (Charis) Poullis in data leakage — Apr 13, 2023

Data leakage and Duplication in Benchmark Dataset

In a recent comprehensive analysis, we have uncovered substantial data leakage, duplication, and annotation discrepancies in the popular CrowdAI Mapping Challenge benchmark dataset, which has been extensively utilized for developing semantic segmentation and footprint extraction algorithms of buildings from satellite imagery. This revelation may call into question the validity of previous works that employed this dataset for model development and evaluation.

To provide a transparent and comprehensive overview of our findings, we have developed a user-friendly web platform that allows users to visualize the extent of data leakage (93%) and duplication (90%) in this dataset. We invite you to explore the issue for yourself at: https://datainspector.app/

TL;DR: The widely used CrowdAI Mapping Challenge benchmark dataset has 90% duplication within the training split and 93% overlap between the training and validation splits.