Breaking Barriers: New Study Advances Chinese Lyric Authorship Attribution

Authorship attribution is a fascinating and challenging task in the realm of natural language processing. It becomes even more complex when dealing with creative works like lyrics, especially in languages like Chinese where public datasets are scarce. A recent study by Yuxin Li, Lorraine Xu, and Meng Fan Wang tackles this issue head-on, proposing a novel approach to cross-genre authorship attribution for Chinese lyrics.

The researchers’ first contribution is the creation of a new, balanced dataset of Chinese lyrics. This dataset spans multiple genres, providing a more comprehensive and diverse corpus for training and testing models. The second contribution is the development and fine-tuning of a domain-specific model, which is then compared against zero-shot inference using the DeepSeek LLM.

The study is built around two central hypotheses. The first is that a fine-tuned model will outperform a zero-shot LLM baseline. The second is that performance will be genre-dependent. The experiments conducted strongly confirm the second hypothesis: structured genres like Folklore & Tradition yield significantly higher attribution accuracy than more abstract genres like Love & Romance.

The first hypothesis, however, receives only partial support. Fine-tuning does improve robustness and generalization in Test1, which involves real-world data and difficult genres. But in Test2, a smaller, synthetically-augmented set, the gains are limited or ambiguous. The researchers attribute this to the design limitations of Test2, such as label imbalance, shallow lexical differences, and narrow genre sampling.

This work is a significant step forward in the field of authorship attribution. It establishes the first benchmark for cross-genre Chinese lyric attribution, highlights the importance of genre-sensitive evaluation, and provides a public dataset and analytical framework for future research. The researchers also offer recommendations for future studies, including enlarging and diversifying test sets, reducing reliance on token-level data augmentation, balancing author representation across genres, and investigating domain-adaptive pretraining.

This study not only advances our understanding of authorship attribution but also underscores the importance of careful dataset design and evaluation in natural language processing tasks.

Scroll to Top