GKG-LLM: A Unified Framework for Generalized Knowledge Graph Construction

2503.11227

Jian Zhang et el.

🤗 2025-03-20

TL;DR
#

Constructing Generalized Knowledge Graphs (GKG), including knowledge, event, and common-sense graphs, is vital for NLP. Current methods build these graphs separately, missing holistic benefits. Challenges arise from task differences, impacting resource use. This paper tackles the issue by proposing a unified framework, aiming to streamline GKG construction and enhance resource efficiency. This involves gathering & categorizing data from 15 sub-tasks across the 3 types of graphs.

The proposed GKG-LLM uses a 3-stage curriculum learning fine-tuning, iteratively injecting knowledge from different graph types. It begins with KG, enhances with EKG, and generalizes with CKG. Experiments show GKG-LLM improves construction across in-domain, OOD, and counter-task data. This framework promotes parameter efficiency and leverages shared knowledge to advance GKG construction and facilitate more unified NLP workflows.

Key Takeaways
#

Why does it matter?
#

This paper addresses a critical need for a unified approach to GKG construction, offering a resource-efficient and holistic solution. The GKG-LLM framework’s success in handling diverse graph types and datasets, including OOD data, highlights its generalizability and potential for broader application. This will influence future research towards more integrated and adaptable knowledge graph technologies.

Visual Insights
#

🔼 Figure 1 illustrates the concept of a Generalized Knowledge Graph (GKG) and its constituent parts: Knowledge Graphs (KGs), Event Knowledge Graphs (EKGs), and Commonsense Knowledge Graphs (CKGs). The left side shows a visual representation of the GKG, highlighting its interconnected nature. The right side provides concrete examples of triples (subject, predicate, object) from each graph type, demonstrating how they build upon one another in terms of complexity and scope. KGs represent facts about entities and their relationships. EKGs extend this by modeling events and their temporal relationships. CKGs further generalize these concepts, focusing on abstract commonsense relationships.
read the caption
Figure 1: An illustration of several triples and graphs. The left half shows a generalized knowledge graph. The right half includes specific examples of triples from KG, EKG, CKG and demonstrates their progressive relationship.

Graphs	Tasks	Datasets	GPT-	Claude-	Gemini-	LlaMA-		Single-	Integrated-	GKG-
Graphs	Tasks	Datasets	4	3	1.5	2-GKG	3-Instruct	SFT	SFT	LLM
KG	SRE	NYT	64.94	66.76	68.59	78.18	55.12	74.39	79.32	80.63
	FRE	FewRel	26.28	27.45	30.20	89.45	22.64	78.65	86.74	90.48
	FRE	TACRED	18.85	20.23	22.43	86.71	12.74	70.66	84.66	88.96
	DRE	DOCRED	38.84	36.28	42.63	83.18	34.63	74.53	83.61	85.71
	JE&RE	FewRel	6.32	5.44	7.52	42.05	3.20	26.76	30.56	34.32
	JE&RE	NYT	6.22	5.85	8.36	53.33	0.0	40.16	48.66	52.27
EKG	SED	ACE2005	17.50	8.57	22.40	32.47	0.0	22.74	34.32	80.63
	DED	WIKIEVENTS	16.54	9.14	14.87	24.87	18.62	29.59	23.84	39.86
	DEAE	WIKIEVENTS	42.58	53.41	47.69	70.46	41.76	63.38	69.30	75.22
	DEAE	RAMS	13.84	5.70	38.49	48.33	30.74	53.43	52.09	63.62
	ETRE	MATRES	39.97	36.62	38.51	62.94	22.79	37.91	44.26	71.51
		ESL	64.24	47.65	42.18	68.96	21.67	74.06	67.63	75.33
		TB-Dense	43.73	36.58	42.43	52.89	36.55	49.30	51.23	53.54
		Causal-TB	6.67	8.01	8.74	42.79	16.43	37.35	49.83	45.26
		MAVEN-ERE	43.80	21.73	42.10	71.55	40.29	37.35	75.44	81.95
		TCR^∗	15.43	18.74	25.34	24.88	24.71	20.68	22.09	26.45
	ECRE	ESL	28.57	19.26	55.21	75.33	26.33	62.92	78.74	84.89
		MAVEN-ERE	51.98	11.36	43.38	76.48	13.37	78.91	88.59	90.18
		Causal-TB^∗	39.67	41.23	43.44	33.94	30.02	48.41	48.80	55.79
	ESRE	HiEve	38.81	30.92	48.83	55.60	48.61	57.64	58.01	58.61
	ESRE	MAVEN-ERE	40.09	13.12	38.09	44.37	33.49	39.11	37.30	48.49
CKG	NER	CoNLL	15.94	14.46	18.27	77.50	15.60	64.74	70.53	82.30
	AG $\dagger$	CNNDM	30	28	22	36	18	35	35	45
	AG $\dagger$	XSum	33	26	29	28	9	24	30	38
	LI	SNLI	51.26	47.56	60.38	69.51	44.50	87.09	89.35	89.03
	LI	MNLI	81.80	39.33	48.80	58.97	53.70	86.78	84.62	86.35
	TC	R8^∗	72.26	36.43	66.58	65.27	58.89	28.83	58.64	69.33
	TC	R52	82.18	83.75	80.63	94.16	29.68	89.02	88.81	90.34
Counter	NLG $\dagger$	WebNLG	78	65	76	83	15	80	80	85
Average Performance			38.25	29.81	39.07	59.70	26.83	52.97	60.41	67.90

🔼 This table compares the performance of different large language models (LLMs) on various knowledge graph (KG) construction sub-tasks. It shows the performance of several LLMs (GPT-4, Claude-3, Gemini, Llama-2, Llama-3) and several methods developed by the authors (Single-SFT, Integrated-SFT, GKG-LLM) across different KG sub-task datasets. The best and second-best results for each sub-task are highlighted. The table also indicates which datasets are out-of-distribution (OOD) and notes that the Rough-L metric (percentage) is used for certain tasks instead of the standard F1-score. The GKG-LLM column presents the results of the authors’ proposed three-stage fine-tuning method.
read the caption
Table 1: Performance comparison across various datasets and tasks. The best result for each sub-task is highlighted in bold, while the second-best result is underlined. The OOD datasets are starred by *. ††\dagger† means the task is evaluated by metric Rough-L of percentage. The results for GPT-4, Claude-3, and Gemini-1.5 are obtained via their respective APIs. LlaMA-2-GKG, LlaMA-3-Instruct, Single-SFT, and Integrated-SFT are implemented by us. The GKG-LLM column represents the final model obtained after three-stage tuning.

In-depth insights
#

Unified GKG-LLM
#

The ‘Unified GKG-LLM’ concept, as presented, embodies a significant advancement in knowledge representation and utilization. The core idea is to integrate Knowledge Graphs (KG), Event Knowledge Graphs (EKG), and Commonsense Knowledge Graphs (CKG) into a single framework. This unification promises to address the limitations of constructing these graphs independently, which often leads to redundant resource consumption and a failure to leverage inherent interconnections. By creating a generalized knowledge graph (GKG) and a Large Language Model (LLM) based around it, this framework offers a more holistic approach to knowledge acquisition and application. One key advantage lies in the potential for improved parameter efficiency, as shared knowledge across KG, EKG, and CKG can be jointly learned. Moreover, the GKG facilitates better knowledge transfer, enabling insights from one graph type to enhance the construction and reasoning capabilities of others. The challenge, however, lies in managing the task-specific differences that initially led to the separate development of these graphs.

Curriculum Tuning
#

Curriculum tuning is a key method for refining large language models (LLMs). It is a gradual training strategy. Starts with basic tasks, then introduces harder ones. Three types of graphs: KG, EKG, and CKG. KG builds foundational skills, EKG enhances specific abilities, CKG achieves generalization. Datasets are key. LLMs enhance real-world use. Diversity in instructions improves tasks

GKG Data Fusion
#

GKG Data Fusion is an exciting area, offering ways to combine different knowledge sources. It tackles challenges like data heterogeneity and semantic inconsistencies to create a richer understanding. Fusion methods could involve entity resolution, relation alignment, and reasoning integration. Potential benefits includes more accurate knowledge inference and improved downstream task performance. Challenges remain in scaling these techniques to large and noisy datasets and ensuring the interpretability of fused knowledge graphs.

LoRA+ Tuning
#

The research paper utilizes LoRA+ tuning, a refined version of LoRA, for efficient model fine-tuning within the GKG framework. LoRA+ accelerates convergence and enables adaptive GKG sub-task handling. The study explores the impact of varying the learning rates(ηΑ and ηΒ) associated with low-rank matrices. Findings highlight the significance of carefully tuning ηΑ and ηΒ values to optimize model performance. The results reveals that selecting the appropriate ηΑ and ηΒ values is crucial for maximizing model performance by efficiently fine-tuning large language models. This approach leverages low-rank adaptation techniques.

GKG: Future AI
#

GKG (Generalized Knowledge Graph) has a strong potential for future AI. GKG facilitates a more holistic representation of knowledge, integrating diverse sources like knowledge graphs, event knowledge graphs, and common sense knowledge graphs. This unified framework enables AI systems to reason across different domains, improving decision-making and problem-solving. GKG’s ability to capture relationships between entities, events, and concepts, combined with advancements in large language models (LLMs), creates opportunities for developing more robust and adaptable AI systems. The development of GKG-LLMs are helping enhance knowledge extraction and representation capabilities of foundation models in specialized domains, especially in healthcare and other specialized areas.

More visual insights
#

More on figures

🔼 Figure 2 shows the distribution of the datasets used for training and evaluating the Generalized Knowledge Graph (GKG) construction task. It visually represents the various sub-tasks involved in constructing three types of graphs: Knowledge Graphs (KGs), Event Knowledge Graphs (EKGs), and Commonsense Knowledge Graphs (CKGs). Each sub-task is associated with one or more datasets, indicating the data sources utilized for that specific task. This figure provides a comprehensive overview of the data utilized for training and testing in the study, highlighting the diverse range of tasks and datasets involved in the GKG construction process.
read the caption
Figure 2: The illustration of the data distribution for all GKG sub-tasks.

🔼 This figure illustrates the three-stage curriculum learning framework used to train the GKG-LLM model. The top section shows the unified GKG dataset (𝒟G) comprising data from KG, EKG, and CKG sub-tasks, as well as a counter task dataset. The bottom section details the three training stages: 1) KG Empowerment: foundational skills are built using KG datasets; 2) EKG Enhancement: specific capabilities are enhanced using EKG datasets; and 3) CKG Generalization: generalization is achieved using CKG and counter task datasets. Thick arrows indicate the transfer of model parameters between stages, starting from a base model to create the G-Micro, G-Mid, and finally, the GKG-LLM model.
read the caption
Figure 3: Three-stage curriculum learning tuning framework of GKG-LLM. The upper part represents the GKG dataset 𝒟Gsubscript𝒟𝐺\mathcal{D}_{G}caligraphic_D start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, consisting of the unified datasets. The lower part shows the three stages of GKG training: the KG empowerment stage using the KG datasets to build foundational skills, the EKG enhancement stage using the EKG datasets to enhance specific capabilities, and the CKG generalization stage using the CKG datasets and the counter task dataset to achieve generalization of the GKG-LLM capabilities. The thick arrows between the stages represent the delivery of model parameters from base model to each version of GKG-LLM.

🔼 This figure displays the average performance across all tasks for three different fine-tuning orders: K-E-C (KG, EKG, then CKG), K-C-E (KG, CKG, then EKG), E-K-C (EKG, KG, then CKG), E-C-K (EKG, CKG, then KG), C-K-E (CKG, KG, then EKG), and C-E-K (CKG, EKG, then KG). The results show varying performance across the different sequences, demonstrating the impact of the chosen order on overall model performance. The K-E-C sequence used in the main study demonstrates the best overall performance, supporting the study’s methodology and indicating that a progressive training approach, starting with simpler knowledge graphs and progressing to more complex ones, yields the best results.
read the caption
Figure 4: Results of different fine-tuning orders. “K-E-C” means the fine-tuning order is KG, EKG and CKG. The following sets of experiments are similar to this one.

🔼 Figure 5 displays the performance comparison between using a single type of graph for fine-tuning and the three-stage fine-tuning approach of GKG-LLM. The graph shows performance across KG, EKG, and CKG sub-tasks. It demonstrates that the three-stage curriculum learning process (KG empowerment, EKG enhancement, CKG generalization) progressively improves the model’s capability across all three graph types, significantly outperforming the single-graph fine-tuning method.
read the caption
Figure 5: Fine-tuning with a single type of graph and performance of different intermediate version in the GKG-LLM.

🔼 This figure displays the impact of varying training dataset sizes on the model’s performance. The x-axis represents the percentage of the complete dataset used for training (10%, 20%, 40%, 60%, 80%, and 100%). The y-axis shows the average F1 score achieved across all tasks. The figure demonstrates the relationship between the amount of training data and the model’s performance across three types of knowledge graph sub-tasks (KG, EKG, CKG), and an overall GKG average. It helps illustrate the point of diminishing returns in terms of performance gain as training data increases.
read the caption
Figure 6: Results of training with different proportions of complete data.

🔼 Figure 7 presents the average performance of various models on out-of-distribution (OOD) datasets. The OOD datasets used are TCR, Causal-TB, and R8, each representing a distinct and challenging test scenario. The figure displays the average F1 score (or other relevant metric) achieved by each model on these three datasets, allowing for a direct comparison of their robustness and generalization capabilities in handling unseen data.
read the caption
Figure 7: The average performance on OOD datasets, consisting TCR, Causal-TB and R8 datasets.

🔼 Figure 8 shows an example data entry from the WIKIEVENTS dataset, illustrating the standardized format used throughout the paper. Each data entry consists of five key fields: ‘ID’ uniquely identifies the specific data point; ‘instruction’ provides the task instructions given to the model; ‘few-shot/zero-shot’ indicates whether few-shot learning examples were provided; ‘input’ contains the input data given to the model for processing; and ‘output’ shows the expected output or ground truth. This standardized format enables the unified processing of diverse sub-tasks within the generalized knowledge graph (GKG) construction framework.
read the caption
Figure 8: An example from the WIKEVENTS dataset. It consists of five fields I⁢D𝐼𝐷IDitalic_I italic_D, instruction sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, few-shot f⁢s𝑓𝑠fsitalic_f italic_s / zero-shot z⁢s𝑧𝑠zsitalic_z italic_s , input xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and output yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

🔼 Figure 9 displays a bar chart comparing the performance of three different model training strategies against the performance of the GKG-LLM model. The three strategies represent progressively more complex training approaches. The chart shows the average F1 scores achieved on three different types of knowledge graph sub-tasks (KG, EKG, CKG) for each model. This helps visualize how a three-stage curriculum learning approach enhances the model’s ability to handle various tasks compared to single-stage training methods.
read the caption
Figure 9: Comparison of Results by different settings and GKG-LLM.

🔼 This heatmap visualizes the performance of the GKG-LLM model under different hyperparameter settings for the LoRA+ fine-tuning technique. The x-axis represents the value of ηB (eta_B), and the y-axis represents the value of ηA (eta_A). Each cell in the heatmap shows the model’s performance (likely F1-score or accuracy) on the GKG construction task under the corresponding ηA and ηB values. The color intensity represents the performance level, with darker colors indicating better performance. This figure is crucial for determining the optimal hyperparameter combination to maximize the model’s effectiveness.
read the caption
Figure 10: Heatmap of Scores for Different ηAsubscript𝜂𝐴\eta_{A}italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT and ηBsubscript𝜂𝐵\eta_{B}italic_η start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT Values for our training strategy.

Variation	KG	EKG	CKG	Avg.
$\mathcal{P}$	72.06	63.42	71.48	67.90
$\mathcal{P}_{\text{si}}$	68.46	59.34	69.10	64.33
$\Delta$	(-3.60)	(-4.08)	(-2.38)	(-3.57)
$\mathcal{P}_{\text{zs}}$	65.17	55.09	66.05	60.06
$\Delta$	(-6.89)	(-8.33)	(-5.43)	(-7.84)
$\mathcal{P}_{\text{si+zs}}$	62.44	52.26	64.66	58.15
$\Delta$	(-9.62)	(-11.16)	(-6.82)	(-9.75)

Graphs	Tasks	Datasets	# Train	# Test	sampled?	held-out?	Original Source
KG	SRE	NYT	96,229	8,110			Paulus [2017]
	FRE	FewRel	56,576	11,775			Han et al. [2018]
	FRE	TACRED	18,448	3,325			Alt et al. [2020]
	DRE	DOCRED	61,380	6,137	✓		Yao et al. [2019]
	JE&RE	FewRel	28,288	11,775	✓
	JE&RE	NYT	48,114	8,110	✓
EKG	SED	ACE2005	3,681	409			Grishman et al. [2005]
	DED	WIKIEVENTS	3,586	365			Li et al. [2021]
	DEAE	WIKIEVENTS	3,586	365
	DEAE	RAMS	7,339	761			Ebner et al. [2020]
	ETRE	MATRES	12,216	1,361			Ning et al. [2019]
		ESL	7,652	852
		TB-Dense	9,257	2,639			Han et al. [2019]
		Causal-TB	5,427	603			Mirza and Tonelli [2016]
		MAVEN-ERE	80,000	5,000	✓		Wang et al. [2022]
		TCR		3,515		✓	Han et al. [2019]
	ECRE	ESL	3,196	356
		MAVEN-ERE	63,980	7,330	✓
		Causal-TB		318		✓
	ESRE	HiEve	12,107	1,348			Glavaš et al. [2014]
	ESRE	MAVEN-ERE	31,365	4,244
CKG	NER	CoNLL	17,293	3,454			Sang and De Meulder [2003]
	AG	CNNDM	51,684	11,490	✓		Chen et al. [2021]
	AG	XSum	50,666	11,334	✓		Hasan et al. [2021]
	LI	SNLI	50,000	10,000	✓		Camburu et al. [2018]
	LI	MNLI	50,000	10,000	✓		Hu et al. [2020]
	TC	R8		7,674		✓	Yamada and Shindo [2019]
	TC	R52	7,816	1,284	✓		Ge and Moh [2017]
Counter	NLG	WebNLG	26,302	6,513			Gardent et al. [2017]

TL;DR#

Key Takeaways#

Why does it matter?#

Visual Insights#

In-depth insights#

Unified GKG-LLM#

Curriculum Tuning#

GKG Data Fusion#

LoRA+ Tuning#

GKG: Future AI#

More visual insights#

Full paper#