Update paper reference

mlcommons · Mar 6, 2024 · 2036c07 · 2036c07
1 parent 271a205
commit 2036c07
Showing 1 changed file with 3 additions and 1 deletion.
diff --git a/docs/croissant-rai-spec.md b/docs/croissant-rai-spec.md
@@ -288,7 +288,7 @@ This section provides an overview of the various use cases that would be served
 
 ## Use case 1: The data life cycle (level: dataset)
 
-Key stages of the dataset life cycle include **motivation, composition, collection process, preprocessing/cleaning/labeling, uses, distribution, and maintenance**. Documenting RAI-related properties of the dataset can encourage its creators to reflect on the process and improve understanding for users. 
+Key stages of the dataset life cycle include **motivation, composition, collection process, preprocessing/cleaning/labeling, uses, distribution, and maintenance** [5]. Documenting RAI-related properties of the dataset can encourage its creators to reflect on the process and improve understanding for users. 
 
 Information generated throughout the cycle addresses different aspects requiring consideration for responsible data usage, including (1) information regarding who created the dataset for which purpose, (2) information when the dataset was created, (3) which data sources were used, (4) information on the versioning of the dataset with timestamps for each version (5) how the data is composed and if it contains noise, redundancies, privacy-critical information, etc. (6) how data was processed (e.g. also containing information on crowdsourcing - see use case 2), (7) how the data is intended to be used, (8) how the dataset will be maintained. In conjunction, properties for documenting the provenance and lineage of the datasets that are derived from revision, modification or extension of existing datasets are also relevant for this use case.
 
@@ -770,3 +770,5 @@ As the size of language models continues to increase, there is a growing demand
 [3] Aroyo, Lora, Alex Taylor, Mark Diaz, Christopher Homan, Alicia Parrish, Gregory Serapio-García, Vinodkumar Prabhakaran, and Ding Wang. "Dices dataset: Diversity in conversational ai evaluation for safety." Advances in Neural Information Processing Systems 36 (2024).
 
 [4] Laurençon, Hugo, Lucile Saulnier, Thomas Wang, Christopher Akiki, Albert Villanova del Moral, Teven Le Scao, Leandro Von Werra et al. "The bigscience roots corpus: A 1.6 tb composite multilingual dataset." Advances in Neural Information Processing Systems 35 (2022): 31809-31826.
+
+[5] Gebru, Timnit, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. "Datasheets for datasets." Communications of the ACM 64, no. 12 (2021): 86-92.