Alejandro Mosquera | Computer Scientist

About

Alejandro Mosquera is a Computer Scientist and online safety expert developing systems and methods for automatic Threat Hunting: malicious network traffic analysis, malware analysis, unknown threat categorization, messaging abuse filters, APT detection and attack chain inference based on Machine Learning.

Research

I enjoy developing software with great colleagues, and I've been fortunate to have worked with many wonderful and talented people. As a researcher, my job usually involves:

Applying quantitative methods to solve complex data problems involving risk scoring, entity and user behaviour analysis, etc.
Translating product ideas into data science problems, and solving them.
Prototyping tools and data pipelines to extract meaningful insight from innovative sources of data.

Other research areas of interest are Natural Language Processing, procedural generation and Trustworthy AI.

In particular:

Identifying and investigating failure modes for AI systems, and building solutions to address them.
Conducting empirical or theoretical research into technical safety and security mechanisms for AI systems.
Evaluating AutoNLP techniques for the accurate and efficient detection of unsafe content.

Personal

Lover of coffee, Earl Grey and LEGO (in alphabetical order). Sometimes I blog. You can also find me participating in competitive Machine Learning challenges during my spare time.

Highlights

2024 - 🏆3rd place (out of 50 teams) at SemEval-2024 Task 6: SHROOM - a Shared-task on Hallucinations and Related Observable Overgeneration Mistakes

2023 - 🥇1st place at CSCML CTF: The International Symposium on Cyber Security, Cryptology and Machine Learning

2023 - 19th place (out of 2681 teams) at HackAPrompt (AICrowd, FlanT5-XXL only): A prompt hacking competition to outsmart LLMs and evade prompt injection defenses

2023 - Top 10% at Stable Diffusion - Image to Prompts (Kaggle): Evaluating Prompt Stealing Attacks Against Text-to-Image Generation Models

2023 - 15th place (out of 84 teams) at SemEval-2023 Task 10: Pretrained Models with Adversarial Training for Online Sexism Detection (EDOS)

2022 - 11th/18 (MNTD baseline beaten by 43% higher AUC) at Trojan Detection Challenge @ NeurIPS 2022

2022 - 56th place / 2nd best score (out of 676 teams) at AI Village CTF @ DEFCON 30

2022 - 🥇1st place (out of 10 teams) at KONVENS-2022 Task 1: Tackling Data Drift with Adversarial Validation: An Application for German Text Complexity Estimation

2022 - 🏆3rd place (out of 20 teams) at IberLEF-2022: Towards Robust Spanish Author Profiling and Lessons Learned from Adversarial Attacks

2021 - 🏆2nd place (in both defender and attacker tracks) at MLSEC-2021: Thwarting Adversarial Malware Evasion with a Defense-in-Depth

2021 - 6th place (out of 31 teams) at IberLEF-2021 Task 1: Deep Learning Approaches to Toxicity Detection in Spanish Social Media Texts

2021 - Granted USPTO anti-ransomware patent: Detecting and protecting against computing breaches based on lateral movement of a computer file within an enterprise

2021 - 🏆3rd place (out of 48 teams) at SemEval-2021 Task 1: Exploring Sentence and Word Features for Lexical Complexity Prediction

2020 - 45th place (out of 6351 teams) at Kaggle IEEE-CIS Fraud Detection

2020 - 10th place (out of 82 teams) at SemEval-2020 Task 12: Offensive Language Detection Using Neural Networks and Anti-adversarial Features

Software

Spanish Metaphone

Metaphone is a phonetic algorithm published in 1990 for indexing words by their English pronunciation. It fundamentally improves on the Soundex algorithm by using information about variations and inconsistencies in English spelling and pronunciation to produce a more accurate encoding, which does a better job of matching words and names which sound similar. As with Soundex, similar sounding words should share the same keys.

This is an adaptation for the Spanish language and implemented in Python.

For example (input word, metaphone):

waterpolo -> UTRPL
aquino -> AKN
rebosar -> RVSR
rebozar -> RVZR
grajea -> GRJ
gragea -> GRJ
encima -> ENZM
enzima -> ENZM
alhamar -> ALAMR

NaiveSumm

NaiveSumm is a naive summarization approach based on Luhn1958 work "The Automatic Creation of Literature Abstracts" It uses the frequencies of words in the document in order to calculate and extract the sentences that include the most frequent words.

Selected publications

Alejandro Mosquera, Elena Lloret, and Paloma Moreda. Towards facilitating the accessibility of Web 2.0 texts through text normalisation. In Proceedings of the LREC workshop: Natural Language Processing for Improving Textual Accessibility (NLP4ITA) ; Istanbul, Turkey., pages 9--14, 2012. [ bib ] [ pdf ]

Alejandro Mosquera and Paloma Moreda. SMILE: An informality classification tool for helping to assess quality and credibility in Web 2.0 texts. In Proceedings of the ICWSM workshop: Real-Time Analysis and Mining of Social Streams (RAMSS), 2012. [ bib ] [ pdf ]

Alejandro Mosquera and Paloma Moreda. TENOR: A lexical normalisation tool for Spanish Web 2.0 texts. In Text, Speech and Dialogue - 15th International Conference (TSD 2012). Springer, 2012. [ bib ]

Estela Saquete, Sonia Vázquez, Elena Lloret, Fernando Llopis, J. M. Gómez-Soriano, and Alejandro Mosquera. Improving reading comprehension of educational texts by simplification of language barriers. In Proceedings of the 5th International Conference on Education and New Learning Technologies (EduLearn 2013), 2013. [ bib ]

Lamine Aouad, Alejandro Mosquera, Slawomir Grzonkowski, and Dylan Morss. SMS spam: A holistic view. In In Proceedings of SECRYPT 2014 - The International Conference on Security and Cryptography, 2014. [ bib ]

Slawomir Grzonkowski, Alejandro Mosquera, Lamine Aouad, and Dylan Morss. Smartphone security: An overview of emerging threats. Consumer Electronics Magazine, IEEE , vol.3, no.4, October 2014, 2014. [ bib ]

Alejandro Mosquera, Lamine Aouad, Slawomir Grzonkowski, and Dylan Morss. On detecting messaging abuse in short text messages using linguistic and behavioral patterns. http://arxiv.org/pdf/1408.3934v1, 2014. [ bib ] [ pdf ]

Ryan R Curtin, Andrew B Gardner, Slawomir Grzonkowski, Alexey Kleymenov, and Alejandro Mosquera. Detecting DGA domains with recurrent neural networks and side information. In Proceedings of the 14th international conference on availability, reliability and security, pages 1--10, 2019. [ bib ] [ pdf ]

Alejandro Mosquera. Amsqr at SemEval-2020 task 12: Offensive language detection using neural networks and anti-adversarial features. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, pages 1898--1905, 2020. [ bib ] [ pdf ]

Alejandro Mosquera. Alejandro Mosquera at SemEval-2021 task 1: Exploring sentence and word features for lexical complexity prediction. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pages 554--559, 2021. [ bib ] [ pdf ]

Alejandro Mosquera López. Alejandro Mosquera at DETOXIS 2021: Deep learning approaches to toxicity detection in Spanish social media texts. In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2021) co-located with the Conference of the Spanish Society for Natural Language Processing (SEPLN 2021), XXXVII International Conference, volume 2943, pages 573--579. CEUR-WS.org, 2021. [ bib ] [ pdf ]

Alejandro Mosquera. Amsqr at MLSEC-2021: Thwarting adversarial malware evasion with a defense-in-depth, 2021. [ bib ] [ pdf ]

Alejandro Mosquera. Amsqr at SemEval-2022 task 4: Towards AutoNLP via meta-learning and adversarial data augmentation for PCL detection. In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), 2022. [ bib ] [ pdf ]

Alejandro Mosquera. Alejandro Mosquera at politices 2022: Towards robust spanish author profiling and lessons learned from adversarial attacks. In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2022) co-located with the Conference of the Spanish Society for Natural Language Processing (SEPLN 2022), XXXVIII International Conference, 2022. [ bib ] [ pdf ]

Alejandro Mosquera. Tackling data drift with adversarial validation: An application for German text complexity estimation. In Proceedings of the 18th Conference on Natural Language Processing (KONVENS 2022), 2022. [ bib ] [ pdf ]

Presentations

2014 - 5th Workshop on Language Analysis for Social Media (LASM), Sweden: Mining Lexical Variants from Microblogs: An Unsupervised Multilingual Approach

2013 - Conference and Labs of the Evaluation Forum (CLEF), Spain: DLSI-Volvam at RepLab 2013: Polarity Classification on Twitter Data

2013 - Tweet Normalization Workshop co-located with 29th Conference of the Spanish Society for Natural Language Processing (SEPLN), Spain: DLSI en Tweet-Norm 2013: Normalización de Tweets en Español

2012 - TSD: Text, Speech and Dialogue, Czech Republic: TENOR: A Lexical Normalisation Tool for Spanish Web 2.0 Texts

2012 - NLDB: Natural Language Processing and Information Systems, Holland: The Study of Informality as a Framework for Evaluating the Normalisation of Web 2.0 Texts

2012 - Real-Time Analysis and Mining of Social Media Streams (RAMSS), Ireland: SMILE: An Informality Classification Tool for Helping to Assess Quality and Credibility in Web 2.0 Texts

2012 - Natural Language Processing for Improving Textual Accessibility (NLP4ITA), Turkey: Towards Facilitating the Accessibility of Web 2.0 Texts through Text Normalisation

2012 - @NLP can u tag #user_generated_content? (NLP4UGC), Turkey: A Qualitative Analysis of Informality Levels In Web 2.0 Texts: The Facebook Case Study

2011 - Symposium in Information and Human Language Technology (STIL), Brazil: The Use of Metrics for Measuring Informality Levels in Web 2.0 Texts

2011 - 3rd Language Technology Conference (LTC), Poland: Enhancing the Discovery of Informality Levels in Web 2.0 texts

Media mentions

2023 - Recognized as a Webometrics ambassador: A ranking of Spanish researchers working abroad according to their Google Scholar Citations public profiles

2021 - 🏆Winner interview for the 3rd Machine Learning Security Evasion Competition (MLSEC) sponsored by CUJO AI, Microsoft, VM-Ray, MRG Effitas and NVIDIA. As a 2x prize winner I had the opportunity to publish my findings about the Adversarial Threat Landscape for Artificial-Intelligence Systems

2016 - 🏆Kaggle Winner's interview for a 3rd place at Allen AI. I was also interviewed by Cade Metz for the Wired magazine and mentioned in the Allen AI final report: Moving Beyond the Turing Test with the Allen AI Science Challenge

2014 - Coverage of our work defending SMS networks while working at Symantec: Security rEsrchRs find nu way 2 spot TXT spam

Procedural generation

JS1k submission using L-systems.

Chromanin.js a procedural texture generation library

Procedural audio using L-systems based on Ville-Matias Heikkilä (2011). Discovering novel computer music techniques by exploring the space of short computer programs.

SuperShapes, SuperFormula based on Johan Gielis (2003). A generic geometric transformation that unifies a wide range of natural and abstract shapes.

Contact

My social accounts are linked below: