El corpus ROBOT-TALK para el reconocimiento del origen robótico de textos en español

Lara Alonso Simón; Ana María Fernández-Pampillón Cesteros

doi:10.21071/arf.v37i.18687

PDF (Español (España))

Published: Feb 26, 2026

DOI: https://doi.org/10.21071/arf.v37i.18687

Keywords:

Spanish text corpus, Comparable monitor corpus, Large language models, Authorship attribution

Lara Alonso Simón

Universidad Complutense de Madrid

Ana María Fernández-Pampillón Cesteros

Universidad Complutense de Madrid

https://orcid.org/0000-0002-6606-0159

Abstract

Abstract: ROBOT-TALK is a comparable corpus of human texts in Spanish and their counterparts written by large language models (LLMs). Its objective is to enable the study of possible linguistic features that differentiate between automatically generated texts and those written by humans. The corpus is a Spanish language resource for recognising human vs. “robotic” authorship of texts and it is designed to (1) enable contrastive linguistic studies between LLMs and humans or between LLMs, (2) study the linguistic evolution of LLMs, and (3) support the creation of linguistic methods and computational tools for attributing human or automatic authorship. It contains texts of three different genres in written language (scientific articles, news articles, and reviews). Each pair of texts, of similar length, deals with the same topic so that the two types of writing can be compared and the discursive characteristics of the texts can be reliably analysed. Samples are collected from gpt-3, text-davinci-003, babbage-002, curie, gpt-3.5-turbo, gpt-4, bloom, bard, gemini-2.0-flash, gemini-2.5-flash, falcon-180B-chat, Mixtral-8x7B -Instruct-v0.1, claude-3-5-sonnet-20240620, claude-3-7-sonnet-20250219 and DeepSeek-V3. The XML tagging of the texts in the corpus allows them to be queried with any text analysis tool that supports this markup standard. ROBOT-TALK has been used with the SketchEngine tool to perform (1) a linguistic analysis to find the most salient features that characterise the texts generated by LLMs; (2) a statistical analysis of linguistic features specific to LLMs compared to a possible general human style in Spanish; (3) a forensic linguistic analysis to verify the reliability of authorship attribution; and (4) the construction of automatic binary and multi-class classifiers based on machine learning to distinguish between robotic and human texts.

Downloads

Download data is not yet available.

How to Cite

Alonso Simón, L., & Fernández-Pampillón Cesteros, A. M. (2026). The ROBOT-TALK corpus for recognising the robotic origin of Spanish texts. Alfinge. Revista De Filología, 37, pp. 9–32. https://doi.org/10.21071/arf.v37i.18687

Issue

Vol. 37 (2025): Num. 37

Section

Monographs

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Política propuesta para revistas que ofrecen acceso abierto. Aquellos autores/as que tengan publicaciones con esta revista, aceptan los términos siguientes:

Los autores/as conservarán sus derechos de autor y garantizarán a la revista el derecho de primera publicación de su obra, el cuál estará simultáneamente sujeto a la Licencia de reconocimiento de Creative Commons que permite a terceros compartir la obra siempre que se indique su autor y su primera publicación esta revista.
Los autores/as podrán adoptar otros acuerdos de licencia no exclusiva de distribución de la versión de la obra publicada (p. ej.: depositarla en un archivo telemático institucional o publicarla en un volumen monográfico) siempre que se indique la publicación inicial en esta revista.
Se permite y recomienda a los autores/as difundir su obra a través de Internet (p. ej.: en archivos telemáticos institucionales o en su página web) antes y durante el proceso de envío, lo cual puede producir intercambios interesantes y aumentar las citas de la obra publicada. (Véase El efecto del acceso abierto).

References

ABDALLA, Mohamed Hesham Ibrahim et al., “A Benchmark Dataset to Distinguish Human-Written and Machine-Generated Scientific Papers”. En: Information, 14, 522, 2023. Disponible en línea en:

https://doi.org/10.3390/info14100522 DOI: https://doi.org/10.3390/info14100522

ALHIJAWI, Bushra et al., “Deep Learning Detection Method for Large Language Models-Generated Scientific Content”. En: Neural Computing and Applications, 37, 2025, pp. 91-104. Disponible en línea en:

https://doi.org/10.1007/s00521-024-10538-y DOI: https://doi.org/10.1007/s00521-024-10538-y

ALONSO SIMÓN, Lara; ESCANDELL-VIDAL, M. Victoria; FERNÁNDEZ TRINIDAD, Marianela, “La identificación de textos “robóticos” en español mediante análisis lingüístico”. En: Albujer Lax, Miguel Ángel; Soriano Moreno, Claudia (eds.), XVI Congreso Internacional de Lingüística General (2025). Libro de resúmenes. Alicante: Universidad de Alicante, 2025, pp. 142-143. Disponible en línea en:

https://cilg2025.ua.es/uploads/site/files/Libro_Resumenes_CILG25.pdf

ALONSO SIMÓN, Lara et al., “¿Tienen GPT-3.5 y GPT-4 un estilo de escritura diferente del estilo humano? Un estudio exploratorio para el español”. En: Revista Electrónica de Lingüística Aplicada, 23, 1, 2025, pp. 34-54. Disponible en línea en: https://doi.org/10.58859/rael.v23i1.666 DOI: https://doi.org/10.58859/rael.v23i1.666

ALONSO SIMÓN, Lara; JIMÉNEZ-BRAVO BONILLA, Miguel; MÁRQUEZ CRUZ, Manuel, “Reconocimiento de autoría de textos generados por modelos GPT y textos humanos: análisis cuantitativo del estilo con el corpus ROBOT-TALK”. En: Albujer Lax, Miguel Ángel; Soriano Moreno, Claudia (eds.), XVI Congreso Internacional de Lingüística General (2025). Libro de resúmenes. Alicante: Universidad de Alicante, 2025, pp. 144-145. Disponible en línea en:

https://cilg2025.ua.es/uploads/site/files/Libro_Resumenes_CILG25.pdf

ALSHAMMARI, Hamed; EL-SAYED, Ahmed; ELLEITHY, Khaled, “AI-Generated Text Detector for Arabic Language Using Encoder-Based Transformer Architecture”. En: Big Data and Cognitive Computing, 8, 32, 2024. Disponible en línea en: https://doi.org/10.3390/bdcc8030032 DOI: https://doi.org/10.3390/bdcc8030032

ANTOUN, Wissam et al., “Towards a Robust Detection of Language Model-Generated Text: Is ChatGPT that Easy to Detect?”. En: Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), 1, 2023, pp. 14-27. Disponible en línea en:

https://aclanthology.org/2023.jeptalnrecital-long.2/

CARDENUTO, João Phillipe et al., “The Age of Synthetic Realities: Challenges and Opportunities”. En: arXiv:2306.11503, 2023. Disponible en línea en: DOI: https://doi.org/10.1561/116.00000138

https://doi.org/10.48550/arXiv.2306.11503

CORIZZO, Roberto; LEAL-ARENAS, Sebastian, “A Deep Fusion Model for Human vs. Machine-Generated Essay Classification”. En: International Joint Conference on Neural Networks (IJCNN), 2023, pp. 1-10. Disponible en línea en:

https://doi.org/10.1109/IJCNN54540.2023.10191322 DOI: https://doi.org/10.1109/IJCNN54540.2023.10191322

CRESPO MIGUEL, Mario; MOYANO MORENO, Isabel; BERNAL ORTIZ, Francisco, “Análisis de autoría en textos robóticos: enfoque lingüístico-forense mediante técnicas cualitativas y cuantitativas”. En: Albujer Lax, Miguel Ángel; Soriano Moreno, Claudia (eds.), XVI Congreso Internacional de Lingüística General (2025). Libro de resúmenes. Alicante: Universidad de Alicante, 2025, pp. 148-149. Disponible en línea en:

https://cilg2025.ua.es/uploads/site/files/Libro_Resumenes_CILG25.pdf

CROTHERS, Evan N.; JAPKOWICZ, Nathalie; VIKTOR, Herna L., “Machine-Generated Text: A Comprehensive Survey of Threat Models and Detection Methods”. En: IEEE Access, 11, 2023, pp. 70977-71002. Disponible en línea en: https://doi.org/10.1109/ACCESS.2023.3294090 DOI: https://doi.org/10.1109/ACCESS.2023.3294090

DORU, Berin et al., “Detecting Artificial Intelligence–Generated Versus Human-Written Medical Student Essays: Semirandomized Controlled Study”. En: JMIR Medical Education, 11, 2025. Disponible en línea en:

https://doi.org/10.2196/62779 DOI: https://doi.org/10.2196/62779

DUGAN, Liam et al., “Real or Fake Text?: Investigating Human Ability to Detect Boundaries Between Human-Written and Machine-Generated Text”. En: Proceedings of the AAAI Conference on Artificial Intelligence, 37, 11, 2023, pp. 12763-12771. Disponible en línea en: https://doi.org/10.1609/aaai.v37i11.26501 DOI: https://doi.org/10.1609/aaai.v37i11.26501

FAGNI, Tiziano et al., “TweepFake: About detecting deepfake tweets”. En: PLoS ONE., 16, 5, 2021. Disponible en línea en:

https://doi.org/10.1371/journal.pone.0251415 DOI: https://doi.org/10.1371/journal.pone.0251415

GOBIERNO DE ESPAÑA, Carta de Derechos Digitales, 2021. Disponible en línea en: https://www.lamoncloa.gob.es/presidente/actividades/Documents/2021/140721-Carta_Derechos_Digitales_RedEs.pdf

GROMOV, Vasilii A. et al., “Spot the bot: the inverse problems of NLP”. En: PeerJ Computer Science. 10, 2024. Disponible en línea en:

https://doi.org/10.7717/peerj-cs.2550 DOI: https://doi.org/10.7717/peerj-cs.2550

GUO, Biyang et al., “How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection”. En: arXiv:2301.07597, 2023. Disponible en línea en: https://doi.org/10.48550/arXiv.2301.07597

HE, Xinlei et al., MGTBench: “Benchmarking Machine-Generated Text Detection”. En: Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, 2024, pp. 2251-2265. Disponible en línea en:

https://doi.org/10.1145/3658644.3670344 DOI: https://doi.org/10.1145/3658644.3670344

HUANG, Baixiang; CHEN, Canyu; SHU, Kai, “Authorship Attribution in the Era of LLMs: Problems, Methodologies, and Challenges”. En: SIGKDD Explor. Newsl, 26, 2, 2025, pp. 21-43. Disponible en línea en:

https://doi.org/10.1145/3715073.3715076 DOI: https://doi.org/10.1145/3715073.3715076

ISLAM, Niful et al., “Distinguishing Human Generated Text From ChatGPT Generated Text Using Machine Learning”. En: arXiv:2306.01761v1, 2023. Disponible en línea en: https://doi.org/10.48550/arXiv.2306.01761

JAWAHAR, Ganesh; ABDUL-MAGEED, Muhammad; LAKSHMANAN, Laks V. S., “Automatic Detection of Machine Generated Text: A Critical Survey”. En: Proceedings of the 28th International Conference on Computational Linguistics, 2020, pp. 2296–2309. Disponible en línea en:

https://doi.org/10.18653/v1/2020.coling-main.208 DOI: https://doi.org/10.18653/v1/2020.coling-main.208

KHALIL, Doaa Samy; FERNÁNDEZ-PAMPILLÓN CESTEROS, Ana María, “Clasificación automática de la autoría robótica o humana de textos en español”. En: Albujer Lax, Miguel Ángel; Soriano Moreno, Claudia (eds.), XVI Congreso Internacional de Lingüística General (2025). Libro de resúmenes. Alicante: Universidad de Alicante, 2025, pp. 146-147. Disponible en línea en: https://cilg2025.ua.es/uploads/site/files/Libro_Resumenes_CILG25.pdf

LEE, Dong Hee; JANG, Beakcheol, “Enhancing Machine-Generated Text Detection: Adversarial Fine-Tuning of Pre-Trained Language Models”. En: IEEE Access, 12, 2024, pp. 65333-65340. Disponible en línea en:

https://doi.org/10.1109/ACCESS.2024.3396820 DOI: https://doi.org/10.1109/ACCESS.2024.3396820

LIU, Yikang et al., “ArguGPT: evaluating, understanding and identifying argumentative essays generated by GPT models”. En: arXiv:2304.07666v2, 2023. Disponible en línea en: https://doi.org/10.48550/arXiv.2304.07666

LIU, Zeyan et al., “Check Me If You Can: Detecting ChatGPT-Generated Academic Writing using CheckGPT”. En: arXiv:2306.05524, 2023. Disponible en línea en: https://doi.org/10.48550/arXiv.2306.05524

LIYANAGE, Vijini; BUSCALDI, Davide; NAZARENKO, Adeline, “A Benchmark Corpus for the Detection of Automatically Generated Text in Academic Publications”. En: Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), 2022, pp. 4692-4700. Disponible en línea en:

https://aclanthology.org/2022.lrec-1.501/

MALOYAN, Narek; NUTFULLIN, Bulat; ILYUSHIN, Eugene, DIALOG-22 RuATD Generated Text Detection. En: Computational Linguistics and Intellectual Technologies, 21, 2022, pp. 396-401. Disponible en línea en:

https://scispace.com/pdf/dialog-22-ruatd-generated-text-detection-1dw3v7x9.pdf

MINDNER, Lorenz; SCHLIPPE, Tim; SCHAAFF, Kristina, “Classification of Human- and AI-Generated Texts: Investigating Features for ChatGPT”. En: Schlippe, Tim; Cheng, Eric C. K.; Wang, Tianchong (eds.), Artificial Intelligence in Education Technologies: New Development and Innovative Practices. AIET 2023. Lecture Notes on Data Engineering and Communications Technologies. Singapur: Springer Nature Singapore, 190, 2023, pp. 152-170. Disponible en línea en:

https://doi.org/10.1007/978-981-99-7947-9_12 DOI: https://doi.org/10.1007/978-981-99-7947-9_12

MITCHELL, Eric et al., “DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature”. En: Proceedings of Machine Learning Research, 202, 2023, pp. 24950-24962. Disponible en línea en:

https://proceedings.mlr.press/v202/mitchell23a/mitchell23a.pdf

MITROVIĆ, Sandra; ANDREOLETTI, Davide; AYOUB, Omran, “ChatGPT or Human? Detect and Explain. Explaining Decisions of Machine Learning Model for Detecting Short ChatGPT-generated Text”. En: arXiv:2301.13852v1, 2023. Disponible en línea en:

https://doi.org/10.48550/arXiv.2301.13852

PAVLYSHENKO, Bohdan M., “Methods of Informational Trends Analytics and Fake News Detection on Twitter”. En: arXiv:2204.04891v1, 2022. Disponible en línea en: https://doi.org/10.48550/arXiv.2204.04891

PIZARRO, Juan, “Using N-grams to detect Bots on Twitter Notebook for PAN at CLEF”. En: Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, 2019. Disponible en línea en: https://ceur-ws.org/Vol-2380/paper_183.pdf

PRESSMAN, Roger S., Ingeniería del software. Un enfoque práctico. México D. F.: McGraw-Hill, 2010.

ROJO, Guillermo, Introducción a la lingüística de corpus en español. Londres: Routledge, Taylor & Francis Group, 2021. Disponible en línea en:

https://doi.org/10.4324/9781003119760 DOI: https://doi.org/10.4324/9781003119760

ROSATI, Domenic, “SynSciPass: detecting appropriate uses of scientific text generation”. En: Proceedings of the Third Workshop on Scholarly Document Processing, 2022, pp. 214–222. Disponible en línea en:

https://aclanthology.org/2022.sdp-1.27/ DOI: https://doi.org/10.1017/S135577182300016X

SADASIVAN, Vinu Sankar et al., “Can AI-Generated Text be Reliably Detected?” En: arXiv:2303.11156v4, 2024. Disponible en línea en:

https://doi.org/10.48550/arXiv.2303.11156

SARVAZYAN, Areg Mikael et al., “Overview of AuTexTification at IberLEF 2023: Detection and Attribution of Machine-Generated Text in Multiple Domains”. En: Procesamiento del Lenguaje Natural, 71 2023, pp. 275-288. Disponible en línea en: https://doi.org/10.26342/2023-71-21

SHAMARDINA, Tatiana et al., “Findings of the The RuATD Shared Task 2022 on Artificial Text Detection in Russian”. En: Computational Linguistics and Intellectual Technologies, 21, 2022, pp. 497-511. Disponible en línea en: DOI: https://doi.org/10.28995/2075-7182-2022-21-497-511

https://scispace.com/pdf/findings-of-the-the-ruatd-shared-task-2022-on-artificial-10uxwywq.pdf

SINCLAIR, John, Corpus, Concordance, Collocation. Oxford: Oxford University Press, 1991.

——, “Corpora for lexicography”. En: van Sterkenberg, Piet (ed.), A practical guide to lexicography. Amsterdam: John Benjamins Publishing Company, 2003, pp. 167-178. DOI: https://doi.org/10.1075/tlrp.6.21sin

——, Corpus and text-basic principles. En: Wynne, Martin (ed.), Developing Linguistic Corpora: A Guide to Good Practice. Oxford: Oxbow Books, 2005, pp. 1-16

STIFF, Harald; JOHANSSON, Fredrik, “Detecting computer-generated disinformation”. En: International Journal of Data Science and Analytics, 13, 2022, pp. 363-383. Disponible en línea en: https://doi.org/10.1007/s41060-021-00299-5 DOI: https://doi.org/10.1007/s41060-021-00299-5

UCHENDU, Adaku; LE, Thai; LEE, “Dongwon, Attribution and Obfuscation of Neural Text Authorship: A Data Mining Perspective”. En: ACM SIGKDD Explorations Newsletter, 25, 1, 2023, pp. 1-18. Disponible en línea en:

https://doi.org/10.1145/3606274.3606276 DOI: https://doi.org/10.1145/3606274.3606276

UCHENDU, Adaku et al., “Authorship Attribution for Neural Text Generation”. En: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 8384-8395. Disponible en línea en:

https://doi.org/10.18653/v1/2020.emnlp-main.673 DOI: https://doi.org/10.18653/v1/2020.emnlp-main.673

UCHENDU, Adaku et al., “TURINGBENCH: A Benchmark Environment for Turing Test in the Age of Neural Text Generation”. En: Findings of the Association for Computational Linguistics, Findings of ACL: EMNLP 2021, 2021, pp. 2001-2016 Disponible en línea en:

https://doi.org/10.18653/v1/2021.findings-emnlp.172 DOI: https://doi.org/10.18653/v1/2021.findings-emnlp.172

WANG, William Yang, ““Liar, liar pants on fire”: A new benchmark dataset for fake news detection”. En: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2, 2017, pp. 422-426. Disponible en línea en:

https://doi.org/10.18653/v1/P17-2067 DOI: https://doi.org/10.18653/v1/P17-2067

YU, Peipeng et al., “CHEAT: A Large-scale Dataset for Detecting ChatGPT-writtEn AbsTracts”. En: IEEE Transactions on Big Data, 11, 03, 2025, pp. 898-906. Disponible en línea en:

https://doi.org/10.1109/TBDATA.2025.3536929 DOI: https://doi.org/10.1109/TBDATA.2025.3536929

ZAITSU, Wataru; JIN, Mingzhe, “Distinguishing ChatGPT (-3.5, -4)-generated and human-written papers through Japanese stylometric analysis”. En: PLoS ONE, 18, 8, 2023. Disponible en línea en:

https://doi.org/10.1371/journal.pone.0288453 DOI: https://doi.org/10.1371/journal.pone.0288453

Article Sidebar

Main Article Content

Abstract

Downloads

Article Details

References

Similar Articles