In empirical regression analysis of financial
accounting, the modeling methods used are basically consistent, namely
Dependent variable = Intercept + Explanatory
variables (group) + Control variables (group) + Random error term
The sample involves collecting relevant data
around these variables, and the sample size is expanded as much as possible,
following the principle that the larger, the better. Additionally, there is
work on modifying the raw data, that is, the original samples, if the
researcher considers it necessary. Then, a series of supplementary empirical
regression analyses are conducted on the basic regression results, such as
robustness tests, endogeneity tests, unit root tests, mechanism tests, and
fixed effects tests, among others, in order to demonstrate that the research
conducted is robust and reliable. The common characteristic of these studies is
that tests are conducted for the sake of testing; all possible tests are
carried out without arguing for the necessity or sufficiency of these tests.
These tests are mainly based on subjective assumptions. Naturally, such
research brings about an obvious problem: Are these studies valid? Is it
possible that a large number of invalid studies are being regarded as valid? Answering
these questions, circling back to the earlier research hypothesis, is the
purpose of this study. In order to study these fundamental issues, this paper
will conduct the corresponding research using logistic regression analysis. The
reason for choosing the logistic regression analysis method is that it meets
the requirements of the collected samples necessary for this study. The samples
collected in this paper all come from the empirical research content of 35
research papers published in Accounting Research, totaling 1,140 sample points.
These sample points are sufficient to support the needs of this study.
Considering the completeness of the article, the data obtained from each
empirical model in the article are regarded as independent samples, because
these data are independent. Although there is a certain subjective correlation
between these models in a certain sense, they are actually relatively
independent and lack a strictly logical necessary connection.
Why is it said that these data in the same
article are independent of each other?
The main reason is that, on the surface,
every empirical model exhibits a certain degree of correlation, and these
correlations are based on subjective speculation. For example, after conducting
a benchmark model test, if there is a subjective suspicion of possible
multicollinearity, a multicollinearity test is carried out; if there is a
subjective suspicion of autocorrelation, an autocorrelation test is then
conducted, and so on. From a research perspective, such work is considered to
be ineffective, or even a deliberate attempt to increase the length of the
paper, and some may engage in this type of research under the mindset of 'doing
more can't hurt.' However, as rigorous research, conducting unnecessary
excessive tests is a waste of resources. From the perspective of precise and
efficient strategy, any correlation between empirical models that is not
pre-justified by a sound logical relationship, but based merely on 'subjective
suspicion,' is regarded as having no necessary connection with each other within
an article, even if the article claims there may be some 'correlations' between
them. The sample collected in this study consists of the goodness-of-fit for
each empirical model. The 1,140 goodness-of-fit measures correspond to 1,140
empirical regression reports, and their random statistical distribution is
shown as follows:
The arrangement of the 1,140 sample points
shown in Figure 1 is random, and the overall distribution appears to be stable.
In order to conduct more precise scientific research on this sample, it is
necessary to establish an appropriate econometric model and adopt suitable
research methods.
Definition 1: For a certain positive number?????(0,1), define the dummy variable as follows:
1, ????2 ? ????
D = {0, ????2
< ?????
where ???? is the nominal measurement parameter, ????2 represents the goodness of fit.
According to econometric theory, for every
empirical regression model, there exists a corresponding goodness-of-fit, which
reflects how well the empirical model fits the sample data. A higher
goodness-of-fit indicates that the corresponding empirical model has stronger
explanatory power for the population; conversely, a lower goodness-of-fit
suggests weaker explanatory power for the population. For an empirical model,
its ability to explain the data exists at least on two levels. The first level
is the measurement of the model's explanatory power for the population, i.e.,
the goodness-of-fit. The second level is the explanatory power of the
independent variables for the dependent variable or the overall population. The
former serves as the foundation, providing a description of the overall
framework, while the latter mainly manifests in the specific functional
performance. Based on the literature referenced, numerous empirical studies
focus on the significance testing of the latter, while the attention to the
former is often consciously or unconsciously neglected. The general strategy is
basically to use it if it fits, discard it if it does not, leaving it in a
somewhat optional and awkward position. This situation mainly arises from
insufficient understanding of goodness-of-fit. The relationship between
goodness-of-fit and the independent variables is like the relationship of
"without the skin, how can the hair attach?" Here, the
"skin" corresponds to the goodness-of-fit, which measures the
effectiveness of the empirical model, and the "hair" corresponds to
the independent variables. The magnitude of the goodness-of-fit is naturally
positively correlated with the empirical model’s explanatory power for the
population. In this paper, the nominal effectiveness of an empirical regression
model in explaining the population is consistently measured by the value of its
goodness-of-fit. To this end, the established testing model is shown as
follows:
???????? = a + b????2 + ????????, ?1?
where a and b are parameters, ? serving as
random disturbance terms.i?1,2,?,n.
Definition 2: Suppose ????2 the goodness of fit of an empirical regression model?? ? (0,1)is a nominal measurement parameter. If ????2(????2 ? ????)
the empirical regression model is said to be nominally valid; otherwise, the
empirical regression model is said to be nominally invalid.
????2(????2 < ????) is
referred to as the nominal validity and nominal invalidity of the corresponding
empirical regression model.
The empirical regression model corresponding
to the goodness-of-fit of nominal invalidity is also nominally invalid. For the
convenience of study, whether it has nominal validity or nominal invalidity,
both are referred to as nominal validity. The distinction is expressed through
probability; for example, a nominal validity of 60% indicates that 60% of the
empirical models are nominally valid, while 40% of the empirical models are
nominally invalid.
Therefore, Model (1) is a basic model for
measuring the empirical validity of regression models, studied using Logistic
regression analysis, and characterized in the form of a probability
distribution to represent the nominal validity of empirical regression
analysis.
Proposition: With a given sample size, the nominal
validity of empirical regression analysis can be obtained through model (1).
The conclusion of the proposition can be
derived from the proof process of the theorem by Sheng Wang [36].