Knowledge Trees: Gradient Boosting Decision Trees on Knowledge Neurons as Probing Classifier
Authors:
Sergey A. Saltykov
Abstract:
To understand how well a large language model captures certain semantic or syntactic features, researchers typically apply probing classifiers. However, the accuracy of these classifiers is critical for the correct interpretation of the results. If a probing classifier exhibits low accuracy, this may be due either to the fact that the language model does not capture the property under investigatio…
▽ More
To understand how well a large language model captures certain semantic or syntactic features, researchers typically apply probing classifiers. However, the accuracy of these classifiers is critical for the correct interpretation of the results. If a probing classifier exhibits low accuracy, this may be due either to the fact that the language model does not capture the property under investigation, or to shortcomings in the classifier itself, which is unable to adequately capture the characteristics encoded in the internal representations of the model. Consequently, for more effective diagnosis, it is necessary to use the most accurate classifiers possible for a particular type of task. Logistic regression on the output representation of the transformer neural network layer is most often used to probing the syntactic properties of the language model.
We show that using gradient boosting decision trees at the Knowledge Neuron layer, i.e., at the hidden layer of the feed-forward network of the transformer as a probing classifier for recognizing parts of a sentence is more advantageous than using logistic regression on the output representations of the transformer layer. This approach is also preferable to many other methods. The gain in error rate, depending on the preset, ranges from 9-54%
△ Less
Submitted 17 December, 2023;
originally announced December 2023.
On the utility of feature selection in building two-tier decision trees
Authors:
Sergey A. Saltykov
Abstract:
Nowadays, feature selection is frequently used in machine learning when there is a risk of performance degradation due to overfitting or when computational resources are limited. During the feature selection process, the subset of features that are most relevant and least redundant is chosen. In recent years, it has become clear that, in addition to relevance and redundancy, features' complementar…
▽ More
Nowadays, feature selection is frequently used in machine learning when there is a risk of performance degradation due to overfitting or when computational resources are limited. During the feature selection process, the subset of features that are most relevant and least redundant is chosen. In recent years, it has become clear that, in addition to relevance and redundancy, features' complementarity must be considered. Informally, if the features are weak predictors of the target variable separately and strong predictors when combined, then they are complementary. It is demonstrated in this paper that the synergistic effect of complementary features mutually amplifying each other in the construction of two-tier decision trees can be interfered with by another feature, resulting in a decrease in performance. It is demonstrated using cross-validation on both synthetic and real datasets, regression and classification, that removing or eliminating the interfering feature can improve performance by up to 24 times. It has also been discovered that the lesser the domain is learned, the greater the increase in performance. More formally, it is demonstrated that there is a statistically significant negative rank correlation between performance on the dataset prior to the elimination of the interfering feature and performance growth after the elimination of the interfering feature. It is concluded that this broadens the scope of feature selection methods for cases where data and computational resources are sufficient.
△ Less
Submitted 29 December, 2022;
originally announced December 2022.