MÔ HÌNH DÓNG HÀNG MỨC TỪ DỰA TRÊN BERT  CHO CẶP CÂU VIỆT – NHẬT

Lê Thanh Tùng; Nguyễn Hồng Bửu Long; Hoàng Khuê

doi:10.54607/hcmue.js.20.2.3618(2023)

MÔ HÌNH DÓNG HÀNG MỨC TỪ DỰA TRÊN BERT CHO CẶP CÂU VIỆT – NHẬT

Lê Thanh Tùng, Nguyễn Hồng Bửu Long, Hoàng Khuê

Tóm tắt

Dóng hàng mức từ giữ vai trò quan trọng trong nhiều công đoạn của xử lí ngôn ngữ tự nhiên. Có nhiều công trình nghiên cứu trên nhiều cặp ngôn ngữ khác nhau, tuy nhiên trên cặp câu song ngữ Nhật-Việt vẫn còn hạn chế. Hầu hết các dóng hàng mức từ Nhật-Việt được tạo từ các công cụ dóng hàng dựa trên phương pháp thống kê, hoặc dựa trên phương pháp học không giám sát, cho kết quả có độ chính xác không cao. Trong nghiên cứu này, chúng tôi xây dựng bộ ngữ liệu dóng hàng mức từ Nhật-Việt bằng tay và sau đó cài đặt và huấn luyện mô hình dóng hàng mức từ tự động cho cặp câu song ngữ Nhật-Việt. Mô hình dóng từ của chúng tôi đạt độ chính xác vượt trổi hơn 20.06 điểm so với công cụ GIZA++. Chúng tôi tạo được mô hình dóng hàng mức từ Nhật-Việt tân tiến ở thời điểm hiện tại.

Từ khóa

BERT; Nhật - Việt; Bộ ngữ liệu; SQuAD; dóng từ

Toàn văn:

XML

Trích dẫn

Akihiro Tamura, T. W. (2014). Recurrent Neural Networks for Word Alignment Model. In Proceedings of the ACL-2014, (pp. 1470-1480).

Ashish Vaswani, N. S. (2017). Attention Is All You Need. In Proceedings of the NIPS 2017, (pp. 5998-6008).

Chris Dyer, V. C. (2013). A Simple, Fast, and Effective Reparameterization of IBM Model 2. In Proceedings of the NAACL-HLT-2013, (pp. 644-648).

David Vilar, M. P. (2016). AER: Do we need to “improve” our alignments? In Proceedings of IWSLT-2006, (pp. 2005-212).

Elias Stengel-Eskin, T. R. (2019). A Discriminative Neural Model for Cross-Lingual Word Alignment. In Proceedings of the EMNLP-IJCNLP-2019, (pp. 910-920).

Franz Josef Och, a. H. (2003, 3). A Systematic Comparison of Various Statistical Alignment Models. Computational Linguistics, 29.

Jacob Devlin, M.-W. C. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the NAACL-2019, (pp. 4171-4186).

João Graça, J. P. (2008). Building a Golden Collection of Parallel Multi-Language Word Alignment. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08). Marrakech, Morocco: European Language Resources Association (ELRA).

Joel Legrand, M. A. (2016). Neural Network-based Word Alignment through Score Aggregation. In Proceedings of the WMT-2016, (pp. 66-73).

Josef, F., & Ney, H. (2003, 3). A Systematic Comparison of Various Statistical Alignment Models. Computational Linguistics, 29.

Le H. B., T. V. (2021). Automatic Word Alignment For English-Vietnamese Bilinguals Corpus Using A Deep Learning Approach. FAIR2021: Fundamental and Applied Information Technology, (pp. 491-498). Ho Chi Minh.

Masaaki Nagata, K. C. (2020). A Supervised Word Alignment Method based on Cross-Language Span Prediction using Multilingual BERT. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 555-565). Association for Computational Linguistics.

Nan Yang, S. L. (2013). Word Alignment Modeling with Context Dependent Deep Neural Network. In Proceedings of the ACL-2013, (pp. 166-175).

Neubig, G. (2015). Kyoto Free Translation Task alignment data package. http://www.phontron.com/kftt/.

Och, F. J., & Ney, H. (2003, 3). A Systematic Comparison of Various Statistical Alignment Models. Comput. Linguist., 29, 19-51.

Pedersen, R. M. (2003). An Evaluation Exercise for Word Alignment. In Proceedings of the HLT-NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond, (pp. 1--10).

Pranav Rajpurkar, R. J. (2018). Know What You Don’t Know: Unanswerable Questions for SQuAD. In Proceedings of the ACL-2018, (pp. 784-789).

Sarthak Garg, S. P. (2019). Jointly Learning to Align and Translate with Transformer Models. In Proceedings of the EMNLP-IJCNLP-2019, (pp. 4452-4461).

Sashank J. Reddi, S. K. (2018). On the Convergence of Adam and Beyond. International Conference on Learning Representations (ICLR) 2018. Vancouver Canada.

Thomas Zenkel, J. W. (2019). Adding Interpretable Attention to Neural Translation Models Improves Word Alignment. ArXiv:1901.11359.

Thomas Zenkel, J. W. (2020). End-to-End Neural Word Alignment Outperforms GIZA++. In Proceeding of the ACL-2020, (pp. 1605-1607).

Toshinori Sato, T. H., & Okumura, M. (2017). Implementation of a word segmentation dictionary called mecab-ipadic-NEologd and study on how to use it effectively for information retrieval (in Japanese). Proceedings of the Twenty-three Annual Meeting of the Association for Natural Language Processing (pp. NLP2017-B6-1). The Association for Natural Language Processing.

Vu Thanh, N. D. (2018, 6). VnCoreNLP: A Vietnamese Natural Language Processing Toolkit. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations (pp. 56-60). New: Association for Computational Linguistics.

Vu, T., Nguyen, D. Q., Nguyen, D. Q., Dras, M., & Johnson, M. (2018, 6). VnCoreNLP: A Vietnamese Natural Language Processing Toolkit. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations (pp. 56–60). New: Association for Computational Linguistics.

Xuansong Li, S. G. (2015). GALE Chinese-English Parallel Aligned Treebank -- Training. Linguistic Data Consortium. Linguistic Data Consortium.

DOI: https://doi.org/10.54607/hcmue.js.20.2.3618(2023)

Tình trạng

Danh sách trống

Tên đăng nhập
Mật khẩu
Ghi nhớ