The RTBF Corpus: a dataset of 750,000 Belgian French news articles published between 2008 and 2021

(2023) International Conference on Corpus Linguistics (JLC) — Location: Grenoble, France (5.July.2023)

Files

JLC_escouflaire_2023.pdf
  • Open Access
  • Adobe PDF
  • 1.08 MB

Details

Authors
Abstract
In this paper, we introduce the RTBF Corpus, a large diachronic corpus of 767,204 Belgian French news articles published between 2008 and 2021 by the Belgian public service media RTBF. We present the contents and structure of the corpus, along with the different layers of metadata available for each text. We also describe the three different versions of the articles available in the corpus (depending on the cleaning and preprocessing steps applied to the text). The RTBF corpus is freely available online in CSV format (https://dataverse.uclouvain.be/dataset.xhtml?persistentId=doi:10.14428/DVN/PEVSSI), for research and teaching purposes only.
Affiliations

Citations

Escouflaire, L., Bogaert, J., Descampe, A., & Fairon, C. (2023). The RTBF Corpus: a dataset of 750,000 Belgian French news articles published between 2008 and 2021. Actes des 11èmes Journées Internationales de la Linguistique de Corpus, 155-159. https://hdl.handle.net/2078.5/269192