Publications | Sara Papi

2026

ICLR
Instruction Following

MCIF: Multimodal crosslingual instruction-following benchmark from scientific talks

Sara Papi, Maike Züfle, Marco Gaido, and 5 more authors

In The Thirteenth International Conference on Learning Representations, 2026

HF🤗 arXiv Code

2025

IWSLT Instruction Following

Findings of the IWSLT 2025 Evaluation Campaign

Idris Abdulmumin, Victor Agostinelli, Tanel Alumäe, and 49 more authors

In Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025), Jul 2025

DOI PDF
IWSLT Foundation Model

The Warmup Dilemma: How Learning Rate Strategies Impact Speech-to-Text Model Convergence

Marco Gaido^*, Sara Papi^*, Luisa Bentivogli, and 6 more authors

In Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025), Jul 2025

DOI arXiv PDF Code Slides
IWSLT Summarization

NUTSHELL: A Dataset for Abstract Generation from Scientific Talks

Maike Züfle, Sara Papi, Beatrice Savoldi, and 3 more authors

In Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025), Jul 2025

HF🤗 DOI arXiv PDF
NAACL
Speech Processing

Prepending or Cross-Attention for Speech-to-Text? An Empirical Comparison

Tsz Kin Lam, Marco Gaido, Sara Papi, and 2 more authors

In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Apr 2025

DOI arXiv PDF Code Poster
TACL Speech Translation

How “Real” is Your Real-Time Simultaneous Speech-to-Text Translation System?

Sara Papi, Peter Polák, Dominik Macháček, and 1 more author

Transactions of the Association for Computational Linguistics, Apr 2025

DOI arXiv PDF Video Poster Slides
INTERSPEECH
Dataset

Granary: Speech Recognition and Translation Dataset in 25 European Languages

Nithin Rao Koluguri, Monica Sekoyan, George Zelenfroynd, and 12 more authors

In Interspeech 2025, Apr 2025

HF🤗 DOI arXiv PDF Code
INTERSPEECH
SpeechLLM

How to Connect Speech Foundation Models and Large Language Models? What Matters and What Does Not

Francesco Verdini, Pierfrancesco Melucci, Stefano Perna, and 9 more authors

In Interspeech 2025, Apr 2025

DOI arXiv PDF

2024

EMNLP
Dataset

MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages

Marco Gaido^*, Sara Papi^*, Luisa Bentivogli, and 6 more authors

In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Nov 2024

HF🤗 arXiv PDF Video Code Poster Slides
EMNLP
Human-Centered AI

What the Harm? Quantifying the Tangible Impact of Gender Bias in Machine Translation with a Human-centered Study

Beatrice Savoldi, Sara Papi, Matteo Negri, and 2 more authors

In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (Social Impact Paper Award) , Nov 2024

Awarded arXiv PDF Poster

EMNLP 2024 Social Impact Paper Award
IWSLT Foundation Model

SimulSeamless: FBK at IWSLT 2024 Simultaneous Speech Translation

Sara Papi, Marco Gaido, Matteo Negri, and 1 more author

In Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024), Aug 2024

arXiv PDF Code Poster
IWSLT Automatic Subtitling

Automatic Subtitling and Subtitle Compression: FBK at the IWSLT 2024 Subtitling track

Marco Gaido, Sara Papi, Mauro Cettolo, and 4 more authors

In Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024), Aug 2024

PDF Poster
IWSLT

FINDINGS OF THE IWSLT 2024 EVALUATION CAMPAIGN

Ibrahim Said Ahmad, Antonios Anastasopoulos, Ondřej Bojar, and 41 more authors

In Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024), Aug 2024

DOI PDF
ACL
Speech Translation

StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection

Sara Papi, Marco Gaido, Matteo Negri, and 1 more author

In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Aug 2024

arXiv PDF Video Code Poster Slides
ACL
Speech Translation

Speech Translation with Speech Foundation Models and Large Language Models: What is There and What is Missing?

Marco Gaido, Sara Papi, Matteo Negri, and 1 more author

In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Outstanding Paper and SAC Award) , Aug 2024

Awarded arXiv PDF Poster Slides

ACL 2024 Outstanding Paper and Senior Area Chair Award
ACL
Automatic Subtitling

SBAAM! Eliminating Transcript Dependency in Automatic Subtitling

Marco Gaido, Sara Papi, Matteo Negri, and 2 more authors

In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Aug 2024

Abs

Subtitling plays a crucial role in enhancing the accessibility of audiovisual content and encompasses three primary subtasks: translating spoken dialogue, segmenting translations into concise textual units, and estimating timestamps that govern their on-screen duration. Past attempts to automate this process rely, to varying degrees, on automatic transcripts, employed diversely for the three subtasks. In response to the acknowledged limitations associated with this reliance on transcripts, recent research has shifted towards transcription-free solutions for translation and segmentation, leaving the direct generation of timestamps as uncharted territory. To fill this gap, we introduce the first direct model capable of producing automatic subtitles, entirely eliminating any dependence on intermediate transcripts also for timestamp prediction. Experimental results, backed by manual evaluation, showcase our solution’s new state-of-the-art performance across multiple language pairs and diverse conditions.
ACL

When Good and Reproducible Results are a Giant with Feet of Clay: The Importance of Software Quality in NLP

Sara Papi^*, Marco Gaido^*, Andrea Pilzer, and 1 more author

In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Aug 2024

Abs

Despite its crucial role in research experiments, code correctness is often presumed solely based on the perceived quality of results. This assumption, however, comes with the risk of erroneous outcomes and, in turn, potentially misleading findings. To mitigate this risk, we posit that the current focus on reproducibility should go hand in hand with the emphasis on software quality. We support our arguments with a case study in which we identify and fix three bugs in widely used implementations of the state-of-the-art Conformer architecture. Through experiments on speech recognition and translation in various languages, we demonstrate that the presence of bugs does not prevent the achievement of good and reproducible results, which however can lead to incorrect conclusions that potentially misguide future research. As countermeasures, we release pangoliNN, a library dedicated to testing neural models, and propose a Code-quality Checklist, with the goal of promoting coding best practices and improving software quality within the NLP community.
ICASSP
st,asr

Leveraging Timestamp Information for Serialized Joint Streaming Recognition and Translation

Sara Papi, Peidong Wang, Junkun Chen, and 4 more authors

In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Aug 2024

DOI
How do Hyenas deal with Human Speech? Speech Recognition and Translation with ConfHyena

Marco Gaido, Sara Papi, Matteo Negri, and 1 more author

Aug 2024

2023

Integrating Language Models into Direct Speech Translation: An Inference-Time Solution to Control Gender Inflection

Dennis Fucci, Marco Gaido, Sara Papi, and 3 more authors

In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Dec 2023

Abs

When translating words referring to the speaker, speech translation (ST) systems should not resort to default masculine generics nor rely on potentially misleading vocal traits. Rather, they should assign gender according to the speakers’ preference. The existing solutions to do so, though effective, are hardly feasible in practice as they involve dedicated model re-training on gender-labeled ST data. To overcome these limitations, we propose the first inference-time solution to control speaker-related gender inflections in ST. Our approach partially replaces the (biased) internal language model (LM) implicitly learned by the ST decoder with gender-specific external LMs. Experiments on en}rightarrowes/fr/it show that our solution outperforms the base models and the best training-time mitigation strategy by up to 31.0 and 1.6 points in gender accuracy, respectively, for feminine forms. The gains are even larger (up to 32.0 and 3.4) in the challenging condition where speakers’ vocal traits conflict with their gender.
TACL Automatic Subtitling

Direct Speech Translation for Automatic Subtitling

Sara Papi, Marco Gaido, Alina Karakanta, and 3 more authors

Transactions of the Association for Computational Linguistics, Nov 2023

DOI
Joint Speech Translation and Named Entity Recognition

Marco Gaido, Sara Papi, Matteo Negri, and 1 more author

In INTERSPEECH 2023, Aug 2023

DOI
When Good and Reproducible Results are a Giant with Feet of Clay: The Importance of Software Quality in NLP

Sara Papi, Marco Gaido, Andrea Pilzer, and 1 more author

Aug 2023

arXiv:2303.16166 [cs]

Abs DOI

Despite its crucial role in research experiments, code correctness is often presumed only on the basis of the perceived quality of results. This assumption comes with the risk of erroneous outcomes and potentially misleading findings. To address this issue, we posit that the current focus on reproducibility should go hand in hand with the emphasis on software quality. We present a case study in which we identify and fix three bugs in widely used implementations of the state-of-the-art Conformer architecture. Through experiments on speech recognition and translation in various languages, we demonstrate that the presence of bugs does not prevent the achievement of good and reproducible results, which however can lead to incorrect conclusions that potentially misguide future research. As a countermeasure, we propose a Code-quality Checklist and release pangoliNN, a library dedicated to testing neural models, with the goal of promoting coding best practices and improving research software quality within the NLP community.
AlignAtt: Using Attention-based Audio-Translation Alignments as a Guide for Simultaneous Speech Translation

Sara Papi, Marco Turchi, and Matteo Negri

In INTERSPEECH 2023, Aug 2023

DOI
Token-Level Serialized Output Training for Joint Streaming ASR and ST Leveraging Textual Alignments

Sara Papi, Peidong Wang, Junkun Chen, and 3 more authors

In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Aug 2023

DOI
Attention as a Guide for Simultaneous Speech Translation

Sara Papi, Matteo Negri, and Marco Turchi

In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Aug 2023

DOI
Direct Models for Simultaneous Translation and Automatic Subtitling: FBK@IWSLT2023

Sara Papi, Marco Gaido, and Matteo Negri

In Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), Aug 2023

DOI

2022

Dodging the Data Bottleneck: Automatic Subtitling with Automatically Segmented ST Corpora

Sara Papi, Alina Karakanta, Matteo Negri, and 1 more author

In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Nov 2022

Abs

Speech translation for subtitling (SubST) is the task of automatically translating speech data into well-formed subtitles by inserting subtitle breaks compliant to specific displaying guidelines. Similar to speech translation (ST), model training requires parallel data comprising audio inputs paired with their textual translations. In SubST, however, the text has to be also annotated with subtitle breaks. So far, this requirement has represented a bottleneck for system development, as confirmed by the dearth of publicly available SubST corpora. To fill this gap, we propose a method to convert existing ST corpora into SubST resources without human intervention. We build a segmenter model that automatically segments texts into proper subtitles by exploiting audio and text in a multimodal fashion, achieving high segmentation quality in zero-shot conditions. Comparative experiments with SubST systems respectively trained on manual and automatic segmentations result in similar performance, showing the effectiveness of our approach.
Does Simultaneous Speech Translation need Simultaneous Models?

Sara Papi, Marco Gaido, Matteo Negri, and 1 more author

In Findings of the Association for Computational Linguistics: EMNLP 2022, Nov 2022

DOI
Efficient yet Competitive Speech Translation: FBK@IWSLT2022

Marco Gaido, Sara Papi, Dennis Fucci, and 3 more authors

In Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022), Nov 2022

DOI
Over-Generation Cannot Be Rewarded: Length-Adaptive Average Lagging for Simultaneous Speech Translation

Sara Papi, Marco Gaido, Matteo Negri, and 1 more author

In Proceedings of the Third Workshop on Automatic Simultaneous Translation, Nov 2022

DOI

2021

Speechformer: Reducing Information Loss in Direct Speech Translation

Sara Papi, Marco Gaido, Matteo Negri, and 1 more author

In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Nov 2021

Abs DOI

Transformer-based models have gained increasing popularity achieving state-of-the-art performance in many research fields including speech translation. However, Transformer’s quadratic complexity with respect to the input sequence length prevents its adoption as is with audio signals, which are typically represented by long sequences. Current solutions resort to an initial sub-optimal compression based on a fixed sampling of raw audio features. Therefore, potentially useful linguistic information is not accessible to higher-level layers in the architecture. To solve this issue, we propose Speechformer, an architecture that, thanks to reduced memory usage in the attention layers, avoids the initial lossy compression and aggregates information only at a higher level according to more informed linguistic criteria. Experiments on three language pairs (en→de/es/nl) show the efficacy of our solution, with gains of up to 0.8 BLEU on the standard MuST-C corpus and of up to 4.0 BLEU in a low resource scenario.
Simultaneous Speech Translation for Live Subtitling: from Delay to Display

Alina Karakanta, Sara Papi, Matteo Negri, and 1 more author

In Proceedings of the 1st Workshop on Automatic Spoken Language Translation in Real-World Settings (ASLTRW), Aug 2021

Abs

With the increased audiovisualisation of communication, the need for live subtitles in multilingual events is more relevant than ever. In an attempt to automatise the process, we aim at exploring the feasibility of simultaneous speech translation (SimulST) for live subtitling. However, the word-for-word rate of generation of SimulST systems is not optimal for displaying the subtitles in a comprehensible and readable way. In this work, we adapt SimulST systems to predict subtitle breaks along with the translation. We then propose a display mode that exploits the predicted break structure by presenting the subtitles in scrolling lines. We compare our proposed mode with a display 1) word-for-word and 2) in blocks, in terms of reading speed and delay. Experiments on three language pairs (en→it, de, fr) show that scrolling lines is the only mode achieving an acceptable reading speed while keeping delay close to a 4-second threshold. We argue that simultaneous translation for readable live subtitles still faces challenges, the main one being poor translation quality, and propose directions for steering future research.
Dealing with training and test segmentation mismatch: FBK@IWSLT2021

Sara Papi, Marco Gaido, Matteo Negri, and 1 more author

In Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021), Aug 2021

Abs DOI

This paper describes FBK’s system submission to the IWSLT 2021 Offline Speech Translation task. We participated with a direct model, which is a Transformer-based architecture trained to translate English speech audio data into German texts. The training pipeline is characterized by knowledge distillation and a two-step fine-tuning procedure. Both knowledge distillation and the first fine-tuning step are carried out on manually segmented real and synthetic data, the latter being generated with an MT system trained on the available corpora. Differently, the second fine-tuning step is carried out on a random segmentation of the MuST-C v2 En-De dataset. Its main goal is to reduce the performance drops occurring when a speech translation model trained on manually segmented data (i.e. an ideal, sentence-like segmentation) is evaluated on automatically segmented audio (i.e. actual, more realistic testing conditions). For the same purpose, a custom hybrid segmentation procedure that accounts for both audio content (pauses) and for the length of the produced segments is applied to the test data before passing them to the system. At inference time, we compared this procedure with a baseline segmentation method based on Voice Activity Detection (VAD). Our results indicate the effectiveness of the proposed hybrid approach, shown by a reduction of the gap with manual segmentation from 8.3 to 1.4 BLEU points.

2020

Mixtures of Deep Neural Experts for Automated Speech Scoring

Sara Papi, Edmondo Trentin, Roberto Gretter, and 2 more authors

In Proc. Interspeech 2020, Aug 2020

DOI