
We are pleased to announce the publication of our new paper, Beyond Perplexity: A Multi-Faceted Analysis of a Novel Densely Connected Transformer. This study examines a central question in contemporary language modeling research. If a Transformer decoder is redesigned so that each layer can access the representations produced by all previous layers, can this richer internal connectivity improve performance in a meaningful way?
The idea is appealing for a clear reason. In several areas of deep learning, dense connectivity has been associated with improved feature reuse, shorter information paths, and potentially more favorable optimization behavior. In our paper, we test whether this intuition also holds for decoder-only autoregressive Transformers, the architectural family underlying many modern large language models.
To address this question rigorously, we designed a methodology aimed at isolating the effect of connectivity from other factors that often confound architectural comparisons. We compared a standard baseline Transformer decoder with a densely connected decoder on two well-known benchmarks for language modeling, Penn Treebank and WikiText-2. The comparison was carried out under two controlled fairness regimes. In the first, both model families were evaluated under the same training recipe, with shared optimization settings and learning-rate search. In the second, the comparison was constrained by the same parameter budget, so that the dense model could not exceed the baseline in parameter count. This distinction was important because it allowed us to separate the possible effect of dense historical connectivity from the simpler effect of adding more capacity.
The experimental setup was complemented by a precise implementation choice. Both datasets were processed with word-level tokenization, using official train, validation, and test splits, and all perplexity values were computed consistently within that vocabulary space. The two architectures shared the same general decoder-only scaffold, while differing in their internal organization. The baseline followed the standard residual Transformer design, whereas the dense variant introduced concatenation-based historical connections followed by learned projection, allowing each layer to reuse information from earlier layers more directly.
The paper was guided by three main research questions
Does dense historical connectivity improve test perplexity compared with a standard Transformer decoder under controlled comparison regimes?
Which architectural factors matter most within the explored design space, including model width, feed-forward size, depth, and number of attention heads?
Do dense and baseline models generate texts with different long-range structural signatures, even when standard predictive metrics do not show a clear advantage?
The answer that emerges is both interesting and methodologically instructive. Dense connectivity does not lead to a systematic reduction in perplexity. On WikiText-2, the baseline remains stronger in both fairness regimes. On Penn Treebank, the gains of the dense model are limited and depend on the comparison setting. This matters because it shows that an architectural idea may be theoretically plausible and still fail to deliver a robust practical advantage once tested under controlled conditions.
A particularly relevant result comes from the ablation study. Within the dense family, the most reliable improvements are associated with depth and feed-forward capacity, rather than with dense connectivity alone. This suggests that much of the observed performance variation is better explained by how model capacity is allocated than by the presence of cross-layer concatenation in itself. In other words, the study helps clarify that dense connectivity interacts with more fundamental architectural factors rather than replacing them as the main driver of performance.
Another important aspect of the work lies in the decision to evaluate the models beyond perplexity. Perplexity remains the standard metric for next-token prediction, but it does not capture every relevant aspect of generated language. For this reason, the paper also includes analyses of learning dynamics, attention behavior, targeted probes, and long-form text generation. The probing tasks and attention diagnostics do not reveal a clear linguistic advantage for the dense architecture in the explored setting, although they do highlight behavioral differences between the two model families.
One of the most original contributions of the paper is the use of Zipf–RQA for analyzing generated text. This framework combines Zipf-rank encoding with Recurrence Quantification Analysis in order to study long-range structural regularities in long-form outputs. Here, the results become especially interesting. Even when perplexity does not improve, the dense and baseline models show systematic structural differences in the organization of generated text. This suggests that architectural changes may alter the global form of language generation even when they do not produce better scores on standard predictive metrics.
From a broader perspective, this is the main message of the article. Evaluating a language model through a single headline number is rarely sufficient for understanding what an architecture is actually doing. A richer methodology, one that combines predictive performance, internal diagnostics, and structural analysis of generated text, can reveal differences that would otherwise remain invisible.
Overall, this publication offers a controlled and transparent contribution to Transformer research. Rather than presenting densification as a simple improvement, it shows where its limits emerge, which design factors matter most, and why multi-faceted evaluation is necessary for understanding architectural innovation in language models. For our lab, this work reflects a broader research direction devoted to studying neural language models as complex systems, whose behavior deserves to be analyzed from several complementary viewpoints.
Please cite as:
De Santis, E., Martino, A., & Rizzi, A. (2026). Beyond Perplexity: A Multi-Faceted Analysis of a Novel Densely Connected Transformer. Applied Sciences, 16(6), 2721. https://doi.org/10.3390/app16062721
BibTeX:
@article{deSantis2026BeyondPerplexity,
author = {De Santis, Enrico and Martino, Alessio and Rizzi, Antonello},
title = {Beyond Perplexity: A Multi-Faceted Analysis of a Novel Densely Connected Transformer},
journal = {Applied Sciences},
year = {2026},
volume = {16},
number = {6},
pages = {2721},
doi = {10.3390/app16062721},
url = {https://doi.org/10.3390/app16062721}
}