Files

tipecs_jlc.pdf
  • Open Access
  • Adobe PDF
  • 500.21 KB

Details

Authors
Show more
Abstract
We present TIPECS ("Train, Infer Predictions, Explain, Clean, Start again"), a corpus cleaning method relying on a mixed approach between machine learning and manual analysis. The aim of our dataset cleaning approach is to remove tokens or segments that are considered as discriminant features by a classification model trained on a given dataset for a given task, but that cannot be generalized to other similar tasks or datasets.
Affiliations

Citations

Bogaert, J., Escouflaire, L., de Marneffe, M.-C., Descampe, A., Standaert, F.-X., & Fairon, C. (2023). TIPECS : A corpus cleaning method using machine learning and qualitative analysis. Actes des 11èmes Journées Internationales de la Linguistique de Corpus, P. 160-164. https://hdl.handle.net/2078.5/269398