Author: Olga S. Shablykina studies NLP at IDMC Université de Lorraine (Nancy). In 2024 she graduated from National Research University – Higher School of Economics (Moscow) with a major in Languages and a minor in Data Science.
While qualitative approaches to metaphor studies are well-established, many researchers incorporate quantitative methods. Even mere counting instances of linguistic metaphors showed valuable insights into the persuasiveness of political speeches (Mio et al., 2005; Sun et al., 2021; Van Stee, 2018) and intensity of economic situation (Landtsheer, 2015). However, manually extracting metaphor-related words on significant volumes of texts is somewhat time consuming and requires expertise from the annotators.
This study aims to investigate the feasibility of using a conversational AI agent to (semi-)automate the task of metaphor detection. The CommandR+ model (Cohere4AI, 2024) on the HuggingChat platform was selected for its accessibility, offering better availability compared to OpenAI alternatives. To create a custom agent, a system prompt was written following such prompting techniques as role assignment and chain-of-thought (Schulhoff et al., 2024)
As a test-of-concept, Antonio Guterres' UN speech on 2023 goals (Guterres, 2023) was chosen for analysis. First, speech writers tend to employ metaphors to have a larger impact on the audience (Scotto di Carlo, 2023). Second, the high quality of UN translations (Šoša, 2022) allows retrieval of semantically close material in multiple languages, namely French and English. Finally, the text in question was not previously annotated for metaphors which safeguards against data leakage into pretraining set (Golchin & Surdeanu, 2023) that could skew the generalization ability of the tool or model.
The AI chat output was compared to a gold standard of human annotation following the MIP(VU) protocol (Reijnierse, 2019; Steen, 2010), which revealed 319 and 272 metaphor-related words in the English and French texts respectively. Also, conversational chat results were compared to FrameBERT (Li et al., 2023), one of the best-performing models on VUA-20 task (Leong et al., 2020) with 0.73 F1 score.
Precision, Recall, and F1 scores were used as performance metrics due to unequal distribution of labels: in each text, less than 8% of words were labeled as metaphor-related. Correct metaphoric and non metaphor labels were counted as True Positives (TP) and True Negatives (TN) respectively.
As regards the results, the conversational agent had higher precision (0.67) but significantly lower recall (0.1) than FrameBERT on the English data. Additionally, the chat agent required more time to complete a full text scan when compared to the BERT-based model: 5-6 minutes vs 2 minutes respectively. FrameBERT (borderline cases included) retrieved the highest F1 score at 0.49, however, it failed to capture some vivid metaphoric adjectives as in phrase “vampiric overconsumption”. When considering French, although it is one of the higher-resource languages, neither the conversational agent nor BERT were optimized for multilingual application.
In conclusion, the surveyed automatic systems for metaphor identification remain less reliable than human annotation. Further inquiries should tackle a wider variety of texts: by language, topic and genre. Also, more sophisticated (multilingual) models and prompts should be tested.
References
Cohere For AI. (2024, September 24). Command R+. Cohere AI. https://docs.cohere.com/docs/command-r-plus
Golchin, S., & Surdeanu, M. (2024). Time travel in LLMs: Tracing data contamination in large language models (No. arXiv:2308.08493). arXiv. http://arxiv.org/abs/2308.08493
Guterres, A. (2023). Secretary-General's briefing to the General Assembly on priorities for 2023. Geneva: UN, 6.
Landtsheer, C. (2015). Media rhetoric plays the market: The logic and power of metaphors behind the financial crises since 2006. Metaphor and the Social World, 5(2), 205–222.
Leong, C. W., Klebanov, B. B., Hamill, C., Stemle, E., Ubale, R., & Chen, X. (2020). A report on the 2020 VUA and TOEFL metaphor detection shared task. In Proceedings of the Second Workshop on Figurative Language Processing (pp. 18–29).
Li, Y., Wang, S., Lin, C., Guerin, F., & Barrault, L. (2023). FrameBERT: Conceptual metaphor detection with frame embedding learning. arXiv preprint arXiv:2302.04834.
Mio, J. S., Riggio, R. E., Levin, S., & Reese, R. (2005). Presidential leadership and charisma: The effects of metaphor. The Leadership Quarterly, 16(2), 287–294. https://doi.org/10.1016/j.leaqua.2005.01.005
Reijnierse, W. G. (2019). Linguistic metaphor identification in French. In G. J. Steen et al., Metaphor identification in multiple languages: MIPVU around the world (pp. 69-90). John Benjamins Publishing.
Schulhoff, S., Ilie, M., Balepur, N., Kahadze, K., Liu, A., Si, C., Li, Y., Gupta, A., Han, H., Schulhoff, S., Dulepet, P. S., Vidyadhara, S., Ki, D., Agrawal, S., Pham, C., Kroiz, G., Li, F., Tao, H., Srivastava, A., ... Resnik, P. (2024). The Prompt Report: A Systematic Survey of Prompting Techniques (No. arXiv:2406.06608). arXiv. http://arxiv.org/abs/2406.06608
Scotto di Carlo, G. (2023). ‘Pushing back against the pushback': WAR and JOURNEY metaphors in UN Secretary-General António Guterres' Commission on the Status of Women speeches. Lingue e Linguaggi, 59, 333–355.
Šoša, I. (2022). Translator training at United Nations headquarters, New York. https://doi.org/10.4324/9781003225249-19
Steen, G. J., Dorst, A. G., Herrmann, J. B., Kaal, A., Krennmayr, T., & Pasma, T. (2010). A method for linguistic metaphor identification: From MIP to MIPVU (Vol. 14). John Benjamins Publishing.
Sun, Y., Kalinin, O. I., & Ignatenko, A. V. (2021). The use of metaphor power indices for the analysis of speech impact in political public speeches. Russian Journal of Linguistics, 25(1), 250–277. https://doi.org/10.22363/2687-0088-2021-25-1-250-277
Van Stee, S. K. (2018). Meta-analysis of the persuasive effects of metaphorical vs. literal messages. Communication Studies, 69(5), 545–566. https://doi.org/10.1080/10510974.2018.1457553