SP10 | A panorama of the use of technology in the European context of productive skills assessment | Vincent Folny
-
The Special Interest Group on Technology of the Association of Language Testers in Europe (ALTE) sent out several questionnaires to its members in 2024 and 2025. 34 universities, public bodies and private companies involved in language testing replied to this questionnaire. This study gives a unique panorama of the use of technology by language testers in Europe for institutions dealing with millions of candidates per year, but also for less tested languages. The diversity of ALTE members and the diversity of languages represented in ALTE give the possibility of this study to come up with significant results. This study will be the starting point of the presentation to better understand where language well-established test providers will be in 2025 with the assessment of productive skills.
After presenting the results of this study, we will explain the procedures currently used by language testers to assess candidates for productive skills to ensure invariant measurement, reproducibility (Engelhard, 2013), feasibility, economic model and validity. We will present the diversity of solutions found by test providers (number of raters, number of tasks, use of the Rasch model, automation, use of Artificial intelligence...). Since Page (1966), automatic scoring has been announced, developed, criticised, rebuilt and improved (Williamson et al., 2012, Klebanov & Madnani, 2022). In 2019 (Devlin et al., 2018), the language model BERT has introduced a new aera for classification tasks based on artificial intelligence and also for automated rating. Since 2019, the automation of productive skills rating has evolved rapidly and improved significantly. Many papers, books have introduced new knowledge and procedures (Jiao & Lissitz, 2020; Klebanov & Madnani, 2022; Sadeghi & Douglas, 2023; Yan et al., 2020; Yaneva & Von Davier, 2023).
In 2025, the use of AI for evaluation is seen as efficient or promising, but also as a magic tool and with significant caveats. There is a need to define what is the need for the language tester, what kind of API, LLM, models should be used and to seriously distinguish the asset between the machine and the human. Opportunities should be analysed but also the limits of generative artificial intelligence and the necessity to distinguish between an encoder and a decoder.
These very technical procedures should not obscure the need to consider the ethical dimension of the use of technology. There is a need to revise the dimensions of diversity, equity and inclusion in the context of automatic essay scoring for language testing and rating of productive skills. This revision will be an opportunity to talk about good practices to be developed by language testers.
At the end of the presentation, we will start a prospective exercise to imagine what the assessment of productive skills will look like for language test providers, considering the diversity of the profile of ALTE members, the diversity of European languages and the resources available. We will work with different scenarios to better understand where the stakes might be in the next round of 10 or 15 years.