Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models

2501.04828

Şaziye Betül Özateş et el.

🤗 2025-01-10

↗ arXiv ↗ Hugging Face ↗ Papers with Code

TL;DR
#

Research in computational linguistics has predominantly focused on high-resource languages, neglecting historically significant yet under-resourced languages such as historical Turkish. This has led to a scarcity of annotated datasets and trained models essential for various natural language processing (NLP) tasks. This severely limits the application of advanced NLP techniques to historical Turkish texts.

This paper addresses this gap by introducing the first NER dataset (HisTR), the first Universal Dependencies treebank (OTA-BOUN), and a clean corpus (OTC) of historical Turkish texts, alongside transformer-based models trained on these new resources for NER, dependency parsing, and POS tagging. The study demonstrates significant improvements in the computational analysis of historical Turkish, offering valuable benchmarks and resources for the NLP community while highlighting challenges such as domain adaptation and language variations across time periods. All resources and models are publicly available.

Key Takeaways
#

Why does it matter?
#

This paper is crucial because it addresses the critical lack of resources for processing historical Turkish, a language with limited digital resources. It provides valuable datasets and models, opening new avenues for research on low-resource historical language processing and impacting related fields like digital humanities and computational linguistics.

Visual Insights
#

🔼 This figure shows a sample of text from the Servet-i Funun journal, a historical Turkish magazine published between 1896 and 1901. The text is presented in two forms: the original Perso-Arabic script and a transliteration using the modern Turkish alphabet. This demonstrates the challenges involved in processing historical Turkish texts due to the use of a different script and significant differences between the historical and modern forms of the language.
read the caption
Figure 1: Transcription of an excerpt from the original document which is written with the Perso-Arabic script.

Partition	# of Sentences	PERSON Counts	LOCATION Counts
Training set	462	264	584
Development set	200	122	210
Ruznamçe test set	150	265	216
Total	812	651	1,010

🔼 This table presents the distribution of sentences in the HisTR (Historical Turkish Named Entity Recognition) dataset across different partitions: training set, development set, and two test sets (one for general evaluation and another specifically designed to test robustness on a more challenging historical text type). For each partition, it shows the total number of sentences and the counts of sentences containing PERSON and LOCATION entities. This breakdown allows readers to assess the size and composition of the dataset used for training and evaluating the named entity recognition models.
read the caption
Table 1: Partitions in the HisTR dataset

In-depth insights
#

HisTR NER Dataset
#

The HisTR NER dataset represents a significant contribution to the field of historical Turkish natural language processing (NLP). Its creation addresses a critical gap in available resources for this under-resourced language. Manually annotated, HisTR provides a high-quality dataset for training and evaluating named entity recognition (NER) models specifically designed for historical Turkish. The dataset’s focus on the Ottoman era, coupled with its inclusion of diverse text types, contributes to its potential for robust model development. However, its relatively small size compared to modern NER datasets could limit the performance of more complex, data-hungry models. Furthermore, the annotation process itself, which involves manually tagging entities such as persons and locations, is labor-intensive and prone to biases. These challenges highlight the need for future research to expand upon HisTR, perhaps with the incorporation of more data or the application of techniques like active learning and data augmentation to maximize its utility for future NLP endeavors.

OTA-BOUN Treebank
#

The creation of the OTA-BOUN treebank represents a significant contribution to the field of historical Turkish NLP. Manually annotated, it offers a crucial resource for syntactic analysis, enabling deeper investigation into historical linguistic structures. The treebank’s inclusion of sentences in both Latin-based and Perso-Arabic scripts adds valuable linguistic diversity. However, its relatively small size presents limitations, highlighting the need for more extensive annotation efforts. The annotation process itself faced challenges due to inconsistencies in historical spelling and the use of archaic terminology, demonstrating the complexity of working with such texts. The challenges underscore the need for ongoing development and expansion of the treebank to enhance its value as a benchmark for future NLP research and development in historical Turkish. The availability of this dataset, along with analysis tools and insights into the challenges of annotation, is vital for the advancement of the field. Its use in benchmarking and training new models is a clear testament to its importance.

Historical Turkish NLP
#

The field of Historical Turkish NLP presents unique challenges due to the significant linguistic and orthographic changes the language underwent over centuries. Unlike modern standardized languages, historical Turkish exhibits substantial variations in vocabulary, grammar, and script (Arabic vs. Latin), making it difficult for standard NLP models to process effectively. Existing resources are extremely scarce, and the available data is often noisy and incomplete due to the challenges in digitizing and transcribing historical documents. The research highlights the crucial need for creating specialized datasets and models, focusing on areas like named entity recognition (NER), dependency parsing, and part-of-speech tagging for historical Turkish texts. This work makes a substantial contribution by introducing novel resources (HisTR, OTA-BOUN, OTC), establishing the baseline for future advancements and providing open-source tools to facilitate this development.

Model Fine-tuning
#

Model fine-tuning in this research paper is a crucial aspect that deserves in-depth analysis. The authors intelligently leverage pre-trained language models, acknowledging the limitations of training from scratch with limited data. The choice to fine-tune existing models is a pragmatic and efficient approach, allowing them to build upon the substantial knowledge embedded within those models. The fine-tuning process, applied to various tasks like NER and dependency parsing, showcases the models’ adaptability to the nuances of historical Turkish. However, the results reveal the challenges of achieving optimal performance with limited historical data. While the models demonstrate strong capabilities, the discrepancy between in-domain and out-of-domain performance underscores the need for more extensive data to improve the generalization of these models. The investigation into the impact of varying training data sizes and combinations highlights the importance of balancing data quantity and quality in fine-tuning. This research effectively showcases both the potential and limitations of fine-tuning pre-trained models in a low-resource language setting, offering valuable insights for future work in this field.

Future Directions
#

Future research should prioritize expanding the datasets used in this study. Larger, more diverse datasets spanning broader time periods and genres of historical Turkish texts are crucial for improving the accuracy and robustness of NLP models. Further investigation into the impact of different pre-training strategies, including multilingual and domain-specific approaches, is needed to optimize model performance for historical Turkish. The development of novel techniques to address the specific challenges presented by historical texts, such as noisy OCR data and varying linguistic features, is another key area of focus. Finally, there’s a significant opportunity to explore the integration of historical Turkish NLP with other digital humanities tools and resources to facilitate more comprehensive historical research. This interdisciplinary approach could uncover valuable insights into various fields of study, such as history, literature, and linguistics, broadening the reach and impact of this research significantly.

More visual insights
#

More on tables

Features	The OTA-BOUN Treebank
Num. of Sentences	514
Num. of Tokens	8,794
Avg. Token Count Per Sentence	17.10
Num. of Unique POS Tags	16
Num. of Unique Morphological Features	52
Num. of Unique Dependencies	40

🔼 Table 2 presents a detailed statistical overview of the Ottoman Turkish-Boğaziçi University (OTA-BOUN) historical treebank, a valuable linguistic resource for researchers working with historical Turkish. The table summarizes key features of the treebank, providing insights into the nature and characteristics of the data. Specifically, it provides the total number of sentences, tokens, and the average token count per sentence, giving a sense of the corpus size and sentence length distribution. It also presents the number of unique parts-of-speech (POS) tags and unique morphological features, offering insights into the richness and complexity of the language represented in the treebank. Finally, it includes the number of unique dependency relations, which are crucial for understanding the grammatical structures captured within the treebank.
read the caption
Table 2: Some statistics of the OTA-BOUN historical Turkish treebank

Relation Type	Count	%	Relation Type	Count	%
acl	348	3.95	dislocated	5	0.06
advcl	197	2.24	fixed	6	0.07
advmod	396	4.49	flat	87	0.99
advmod:emph	87	0.99	goeswith	5	0.06
amod	620	7.04	iobj	26	0.30
appos	2	0.02	mark	27	0.31
aux	39	0.44	nmod	137	1.55
case	257	2.92	nmod:poss	746	8.47
cc	228	2.59	nsubj	507	5.75
cc:preconj	12	0.14	nsubj:pass	22	0.25
ccomp	120	1.36	nummod	57	0.65
compound	76	0.86	obj	557	6.32
compound:lvc	246	2.79	obl	873	9.91
compound:redup	33	0.37	obl:agent	4	0.05
conj	607	6.89	orphan	4	0.05
cop	48	0.54	parataxis	10	0.11
csubj	42	0.48	punct	1207	13.70
dep	14	0.16	root	514	5.83
det	508	5.76	vocative	7	0.08
discourse	82	0.93	xcomp	49	0.56

🔼 This table presents a detailed breakdown of the dependency relations found within the Ottoman Turkish Universal Dependencies (OTA-BOUN) treebank. It lists each dependency relation type, its frequency count, and its percentage relative to the total number of dependency relations in the treebank. This provides valuable insights into the syntactic structures characteristic of historical Turkish, such as the prevalence of particular types of relationships between words in sentences.
read the caption
Table 3: Counts and percentages of dependency relation types in the OTA-BOUN treebank

	TR-BOUN	IMST-UD	OTA-BOUN
Avg. token count per sentence	12.41	10.01	17.10
conj (%)	5.66	4.96	6.89
compound:lvc (%)	1.0	0.90	2.79
acl (%)	2.78	2.64	3.95

🔼 Table 4 presents a quantitative comparison of the OTA-BOUN historical Turkish treebank with two widely used modern Turkish treebanks (TR-BOUN and IMST-UD). The comparison focuses on key metrics related to token and dependency features, offering insights into the structural differences between historical and modern Turkish. Metrics include the average number of tokens per sentence, the percentage of conjunct (conj) dependency relations, and the percentage of light verb compound (compound:lvc) dependency relations and adnominal clause (acl) relations. These metrics provide valuable insights into the syntactic and stylistic characteristics of each treebank, highlighting how historical Turkish differs from modern Turkish in terms of sentence structure and grammatical constructions.
read the caption
Table 4: Comparison of historical Turkish treebank with the two most frequently used modern Turkish treebanks in terms of token and dependency metrics

Expected Text	Extracted Text	Error Analysis
Dilberün her handesi bin can bağışlar e aşuya	Dil-beruñ her òandesi biñ cÀn baàışlar èÀşıúa	Diacritical Encoding Error: Unicode normalization failure in historical Turkish diacritics and characters. The system incorrectly encodes special characters ’ñ’ and ’À’, resulting in ambiguity. Technical cause: Non-standardized Unicode point mapping for Ottoman-specific diacritics.
Bu mutabakatla beraber, keşf edilen eski yazıldığı veçhile Türkçe karşılığı lafzıdır.	Bu mutabakatle beraber, keşf edilen eski ya WU J. ıS i J e Ha Tı Ye Kef Lam Mim Nun te de yazıldığı veçhile Türkçe karşılığı lafzıdır.	Script Conversion Error: Critical failure in Arabic-Latin script conversion pipeline. OCR system’s inability to properly map Arabic script ligatures to Latin characters due to contextual shape variations. Root cause: Inadequate handling of Unicode ranges U+0600-U+06FF.
GÜRİZ yahut GÜRİZGAH:	G Ü R ÎZ : , yâhut G Ü R İZ G Â H :	Word Segmentation Error: Tokenization algorithm failure in word recognition. Improper word boundary detection caused by missing morphological analysis support. Technical impact: Loss of semantic unity in words.
HİSÂB-ı CÜMEL: Ebced hisâbının diğer adıdır	HtSÂB-t C Ü M EL: Ebced hi-sâbının diğer adıdır	Character Substitution Error: Systematic misclassification of Turkish ’İ’ character as ’t’. Error stems from inadequate training data representation of Turkish-specific uppercase dotted ’İ’. Technical cause: Unicode point confusion between U+0130 and U+0074.
İran şâirlerinden: Şevket Ferâhî’nin	İran şâirlerinden: J i j Z j S ’ C j	Mixed Script Error: Complete text fragmentation due to script detection failure. System’s inability to maintain consistent character encoding across different writing systems. Root cause: Inadequate handling of bi-directional text rendering.

🔼 This table presents examples of text extraction errors encountered during the digital conversion of historical Turkish documents. It showcases the types of errors, their causes (e.g., diacritical encoding, script conversion, word segmentation, character substitution, mixed script issues), and analysis of why these errors occurred during the digitization process. Each row details a specific instance with the original (expected) text, the extracted text containing errors, and a breakdown explaining the nature and source of the error, including references to Unicode issues and challenges related to processing historical Turkish script variations.
read the caption
Table 5: Analysis of sample text extraction errors in digital conversion of historical Turkish documents

Model Descriptions
`BERTurk+MilliyetNER`	`BERTurk fine-tuned only using MilliyetNER,`
	`a large NER dataset for modern Turkish.`
`BERTurk+MilliyetNER+HisTR`	`BERTurk+MilliyetNER further fine-tuned using`
	`HisTR, the small dataset for historical Turkish.`
`BERTurk+HisTR`	`BERTurk fine-tuned only using HisTR.`
`mBERT+WikiANN+HisTR`	`mBERT fine-tuned on WikiANN, a large multilingual`
	`NER dataset, and further fine-tuned using HisTR.`
`mBERT+HisTR`	`mBERT fine-tuned only using HisTR.`
`TURNA+MilliyetNER+HisTR`	`TURNA fine-tuned on MilliyetNER and further`
	`fine-tuned using HisTR.`
Model Performance
—	—	—	—	—	—	—
	HisTR Development Set			Ruznamçe Test Set
Name	Prec.	Recall	F1	Prec.	Recall	F1
`BERTurk+MilliyetNER`	75.39	71.99	73.65	53.84	61.95	57.58
`BERTurk+MilliyetNER+HisTR`	90.26	92.17	91.21	59.92	64.03	61.91
`BERTurk+HisTR`	88.63	91.57	90.07	54.49	61.75	57.89
`mBERT+WikiANN+HisTR`	80.73	87.05	83.77	41.17	41.93	41.49
`mBERT+HisTR`	83.95	88.25	86.05	43.19	42.20	42.69
`TURNA+MilliyetNER+HisTR`	77.62	80.26	78.92	57.61	41.58	48.30

🔼 Table 6 presents the performance of three different pre-trained Named Entity Recognition (NER) models (BERTurk, mBERT, and TURNA) on the HisTR dataset. The HisTR dataset is specifically designed for historical Turkish texts, posing unique challenges not encountered in modern Turkish NER. The table shows the performance of the models on both an in-domain development set and an out-of-domain test set. The in-domain data consists of similar texts to those in the training data, while the out-of-domain data has different characteristics. Multiple experiments were conducted for each model using various combinations of training data. The results are reported using the precision, recall, and F1-score for each setting, offering insights into how well each model generalizes to the different datasets.
read the caption
Table 6: The overall performance of BERTurk, mBERT, and TURNA NER models on the in-domain development and out-of-domain test sets of the HisTR dataset when using different combinations of fine-tuning sets

Model Descriptions		OTA-BOUN Test Set (Historical Turkish)	TR-BOUN Test Set (Modern Turkish)
STEPS_BERTurk+TR_BOUN	STEPS parser with BERTurk, fine-tuned only using TR_BOUN, a large dependency treebank for modern Turkish.
STEPS_BERTurk+TR_BOUN+OTA_BOUN	STEPS_BERTurk+TR_BOUN further fine-tuned using OTA_BOUN, a small treebank for historical Turkish.
STEPS_BERTurk+OTA_BOUN	STEPS parser with BERTurk, fine-tuned only using OTA_BOUN
STEPS_mBERT+TR_BOUN	STEPS parser with BERTurk, fine-tuned only using TR_BOUN.
STEPS_mBERT+TR_BOUN+OTA_BOUN	STEPS_mBERT+TR_BOUN further fine-tuned using OTA_BOUN.
STEPS_mBERT+OTA_BOUN	STEPS parser with mBERT, fine-tuned only using OTA_BOUN.
Model Performance
—	—	—	—	—
		OTA-BOUN Test Set	TR-BOUN Test Set
		(Historical Turkish)	(Modern Turkish)
Name	Tra. Size	UAS	LAS	UPOS F1
STEPS_BERTurk+TR_BOUN	7,803	79.92	71.29	94.76
STEPS_BERTurk+TR_BOUN+OTA_BOUN	7,917	81.51	73.79	94.98
STEPS_BERTurk+OTA_BOUN	114	68.87	59.70	91.56
STEPS_mBERT+TR_BOUN	7,803	72.96	64.32	92.26
STEPS_mBERT+TR_BOUN+OTA_BOUN	7,917	75.86	67.87	93.12
STEPS_mBERT+OTA_BOUN	114	61.43	49.62	88.68

🔼 Table 7 presents the results of dependency parsing and POS tagging experiments using the STEPS parser with BERTurk and mBERT models. It shows the performance of different model configurations (fine-tuned on various combinations of the OTA-BOUN and TR-BOUN treebanks) on both historical (OTA-BOUN) and modern (TR-BOUN) Turkish datasets. The table includes metrics such as unlabeled attachment score (UAS), labeled attachment score (LAS), and Universal Part-of-Speech (UPOS) F1-score, providing a comprehensive evaluation of the models’ ability to accurately identify dependency relations and POS tags in both historical and modern Turkish text.
read the caption
Table 7: The overall performance of BERTurk- and mBERT-based models on the test sets of the OTA-BOUN and TR-BOUN treebanks, presented for different combinations of fine-tuning sets. UAS and LAS represent unlabeled and labeled attachment scores, respectively, which are used to evaluate the models’ performance in constructing dependency relations. UPOS F1 refers to the F1 score of the sequence tagger models in predicting the universal POS tags of words in the corresponding test sets.

TL;DR#

Key Takeaways#

Why does it matter?#

Visual Insights#

In-depth insights#

HisTR NER Dataset#

OTA-BOUN Treebank#

Historical Turkish NLP#

Model Fine-tuning#

Future Directions#

More visual insights#

Full paper#