\documentclass{letter}

\begin{document}

\documentclass{article}

\usepackage[utf8]{inputenc}

\title{Performance of New Two-parameter Estimator for Multinomial logit model}

\date{September 2021}

\usepackage{amsmath}

\begin{document}

\maketitle

\textbf{Abstract}

Multinomial distribution in data analysis poses a threat to authenticity where the predictors have a linear relationship, hence ridge regression was considered as a solution to avoid erroneous non-significant results in the Multinomial Logit Model (MLM). Two-parameter Estimator (TPE), however, produced better results when extensively used, the Mean Squared Error (MSE) criterion in Monte Carlo Simulations. The study reports that in case of low moderate multicollinearity among the predictors, the estimated MSE of the $\operatorname{NTPE}\left(\hat{k}_{o p t}, \hat{d}_{o p t}\right)$ is smaller than the MLE. As the number of predictors increases, the respective MSE decreases. But in case of high multicollinearity, the $\operatorname{TPE}\left(\hat{k}_{\min }, \hat{d}_{\min }\right)$ showed consistency with the least MSE than the other estimators.

\maketitle

\textbf{Keywords }

Multinomial logit model, multicollinearity, Mean Square Error, Maximum Likelihood Estimator, New Two-parameter Estimator.

\section{Introduction}

The multiple linear regression model is given as follows$$y=X \beta+\varepsilon$$

In the above equation $\beta$ is a p×1 a vector of unknown regression coefficients, y is an n×1 vector of dependent variables, X is an n×p matrix of independent variables and $\epsilon$ is an vector of  random error term which is normally distributed with zero mean vector and $\sigma^{2} I_{n}$ covariance matrix. Where $I_{n}$ is an identity matrix of n order. For estimating the regression coefficients $\beta$ the method of ordinary least squares is applied. Which can be obtained as $\hat{\beta}=\left(X^{\prime} X\right)^{-1} X^{\prime} y$, the covariance matrix for $\hat{\beta}$ can be calculated as $\operatorname{Cov}(\hat{\beta})=\sigma^{2}\left(X^{\prime} X\right)^{-1}$. Both $\hat{\beta}$ and $\operatorname{cov}(\hat{\beta})$ depends on the properties of $X^{\prime} X$ matrix. When the $X^{\prime} X$ matrix is ill-conditioned matrix then ordinary least square estimator become sensitive to the number of error. It is difficult to make valid statistical inference when the regression parameters become statistical insignificant or gave wrong signs. When the $X^{\prime} X$ matrix is ill-conditioned matrix then it mean the correlation exists among the explanatory variables. This problem is known as multicollinearity.

The issue of multicollinearity can be resolved by the various methods available in the literature e.g. parameterizing the model, collecting the additional data, the principal component method and the Ridge regression method. Where the ridge regression method is one of the most widely used method proposed by Hoerl and Kennard (1970) in the presence of multicollinearity. They suggested to add the small value known as ridge parameter k in the diagonal elements of $X^{\prime} X$ matrix and the proposed estimator can be obtained as $$\hat{\beta}_{R}=\left(X^{\prime} X+k I_{\rho}\right)^{-1} X^{\prime} y, \quad k \geq 0$$, Which is recognized as the ridge regression estimator. This proposed estimator has smaller MSE than the ordinary least square method. Where the constant value $k \geq 0$ is known as Biased parameter of shrinkage parameter from the observed data.

A new biased estimate known as Liu estimate is proposed by Liu (1993). In this estimator the parameter known as shrinkage parameter is denoted by d, where 0<d<1. Where the Liu estimator $\hat{\beta}_{L}$ contains the advantages of both the Stein estimator $\hat{\beta}_{S}$ and the Ridge estimator $\hat{\beta}_{R}$. Where the Liu estimator $\hat{\beta}_{L}$ is proposed by considering the concept of the ridge estimator $\hat{\beta}_{R}$. A simulation study shows that $\hat{\beta}_{L}$ is effective and the generalized form of $\hat{\beta}_{L}$ can be same as $\hat{\beta}_{L}$. The $\hat{\beta}_{L}$ is a linear function of d that’s why it is easy to choose the value of d. The Liu estimator (1993) can be written as follows $$\hat{\beta}_{L}=\left(X^{\prime} X+I\right)^{-1}\left(X^{\prime} X+d I\right) \hat{\beta}_{O L S}$$.

The Liu estimator and the Ridge estimator is the commonly applied method and give effective results in the presence of multicollinearity. Further this work was extended by Schaeffer et al. (1984) for the logit model. Then Mansson and Shukur (2011) proposed some new methods for the estimation of ridge parameter k for the logit regression model. This method also have some disadvantages because the given estimated parameters are non-linear function of  the ridge parameter k. Which lies between zero to infinity.

Considering the work of Liu (1993) many researchers proposed new methods of estimation to estimate the shrinkage parameter d. Then after some time in (2010) Yang and Chang proposed another estimator which performed better than all other available estimators. The estimator known as New Two-parameter estimator for the linear regression model. They combined the three estimators (OLS, Ridge and Liu estimator) and proposed New Two-parameter estimator which can be writer as follows

$$\hat{\beta}_{N T P E}=\left(X^{\prime} X+I_{p}\right)^{-1}\left(X^{\prime} X+d I_{p}\right)\left(X^{\prime} X+k I_{p}\right)^{-1} X^{\prime} y$$

\subsection{Special Cases of the New Two-parameter Estimator(NTPE)}

Now from the definition of $\hat{B}(k, d)$, we can see that the NTPE is a general estimator which includes the OLS, RR and Liu estimator as special cases.

\begin{itemize}

\begin{item}

$$\hat{\beta}(0,1)=\hat{\beta}=\left(X^{\prime} X\right)^{-1}\left(X^{\prime} y\right), \quad \text { The OLS estimator. }$$

\end{item}

\begin{item}

$$\hat{\beta}(k, 1)=\hat{\beta}(k)=\left(X^{\prime} X+k I\right)^{-1}\left(X^{\prime} y\right), \quad \text { The RR Estimator. }$$

\end{item}

\begin{item}

$$\hat{\beta}(0, d)=\hat{\beta}(d)=\left(X^{\prime} X+I\right)^{-1}\left(X^{\prime} X+d I\right) \hat{\beta}, \text { The Liu Estimator. }$$

\end{item}

\end{itemize}

The objective of this article is to propose New Two-parameter estimators and compare its performance with MLE under the multinomial logit model. The organization of the article is as follows: Statistical methodology is presented in sec. 2. A Monte Carlo Simulation study has been conducted in sec. 3. Some concluding remarks are provided in sec. 4.

\section{Methodology}

\subsection{The multinomial logit model}

The multinomial logit (MNL) model is developed by Luce (1959) is one of the most popular statistical methods when the dependent variable consists of m different categories and m>2. The MNL specifies $$\pi_{j}=\frac{\exp \left(x_{l} B_{j}\right)}{\sum_{j-1}^{m} \exp \left(x_{l} B_{j}\right)^{2}} \quad j=1, \ldots, m$$ Where $B_{j}$ is a (p+1)×1 vector of coefficients and $x_{l}$ is the lth row of X which is an n×(p+1) data matrix with p predictors. The most common method MLE is used for the estimation of $B_{j}$ where the following log likelihood should be maximized: $$l=\sum_{l=1}^{N} \sum_{j=1}^{m} y_{l j} \log \left(\pi_{l j}\right)$$ Where the above stated equation is the log likelihood of the multinomial logit and if we take the first derivative of the above stated equation to be equal to zero. Then the MLE estimates can be found by solving the following equation: $$\frac{\partial l}{\partial B_{j}}=\sum_{l=1}^{N}\left(y_{l j}-\pi_{l j}\right) x_{i}=0$$ Hence the above equation is nonlinear in $\beta$, since we will use the iterative weighted least square (IWLS) technique to solve this problem: $$\beta_{j}^{(M L)}=\left(X^{\prime} W_{j} X\right)^{-1} X^{\prime} W_{j} z_{j}$$.

Where $W_{j}=\pi_{l j}\left(1-\pi_{l j}\right)$ and $z$ is a vector where the $l^{t h}$ element equals $z_{i}=\log \left(\pi_{l j}\right)+\frac{\left(y_{l j}-\pi_{l j}\right)}{\pi_{l j}\left(1-\pi_{l j}\right)}$. The asymptotic covariance matrix of the ML estimator equals the inverse of the matrix of second derivatives (most commonly referred to as the inverse of the Hessian matrix):

$$\operatorname{Cov}\left(\beta_{j}^{(M L)}\right)=-E\left(\frac{\partial^{2} l}{\partial \beta_{j} \partial \beta_{k}^{\prime}}\right)$$

$$\operatorname{Cov}\left(\beta_{j}^{(M L)}\right)=\left(X^{\prime} W_{j} X\right)^{-1}$$

and the asymptotic MSE equals:

$$E\left(L_{j}^{2(R R)}\right)=E\left(\beta_{j}^{(M L)}-\beta\right)^{\prime}\left(\beta_{j}^{(M L)}-\beta\right)$$

$$E\left(L_{j}^{2(R R)}\right)=\operatorname{tr}\left[\left(X^{\prime} W_{j} X\right)^{-1}\right]$$

$$E\left(L_{j}^{2(R R)}\right)=\sum_{i=1}^{p} \frac{1}{\lambda_{j t}}$$

Where $\lambda_{j i}$ is the ith eigenvalue of the $X^{\prime} W_{j} X$ matrix. In the presence of multicollinearity the weighted matrix of cross-products, $X^{\prime} W_{j} X$, is ill-conditioned which leads to instability and high variance of the ML estimator. In that situation it is very hard to interpret the estimated parameters since the vector of estimated coefficients is on average too long.

\subsection{The Multinomial Ridge Regression Estimator}

Multinomial Logit Ridge Regression (MLNRR) estimator is first purposed by Mansson et.al (2018) for solving the problem of inflated variance of the maximum likelihood estimator. They introduced $\beta_{M L}$ by using iterative weighted least square (IWLS) technique which minimizes the weighted sum of square of error (WSSE). where WSSE is the optimal estimator in a WSSE sense. Following Schaeffer (1986), the Purposed Estimator was:

$$\hat{\beta}_{R R}=\left(X^{\prime} W_{j} X+k I\right)^{-1} X^{\prime} W_{j} X \beta_{M L}$$

Where $W_{j}=\pi_{j}\left(1-\pi_{j}\right)$ and $z$ is a vector where the $l^{\text {th }}$ element equals $z_{i}=\log \left(\pi_{j}\right)+\frac{\left(y_{j}-\pi_{j}\right)}{\pi_{j}\left(1-\pi_{j}\right)}$. The above estimator minimizes the rise of the WSSE. The asymptotic mean square error of the above estimator proposed by Mansson and Shukur (2011) is given below. $$\operatorname{EMSE}\left(\hat{\beta}_{R R}\right)=E\left(\hat{\beta}_{R R}-\beta\right)^{\prime}\left(\hat{\beta}_{R R}-\beta\right)$$

$$\operatorname{EMSE}\left(\hat{\beta}_{R R}\right)=\sum_{i=1}^{p} \frac{\lambda_{i j}}{\left(\lambda_{i j}+k\right)^{2}}+k^{2} \sum_{i=1}^{p} \frac{\alpha_{i j}}{\left(\lambda_{i j}+k\right)^{2}}$$

$$E\left(\operatorname{EMSE}\left(\hat{\beta}_{R R}\right)\right)=\gamma_{1}(k)+\gamma_{2}(k)$$

Where $\alpha_{i j}{ }^{2}$ is describe as the $i^{t h}$ value of $\gamma_{j} \beta_{R R}$ and eigenvector is $\gamma_{j}$ such that $X^{\prime} W X=$ $\gamma_{j}^{\prime} \wedge_{j} \gamma_{j}$ where $\wedge_{j}$ equals $\operatorname{diag}\left(\gamma_{i j}\right)$. Hoerl and Kennard (1970)  also exposed that there exist a $k$ such that $E\left(L_{R R}^{2}\right)<E\left(L_{M L}^{2}\right)$. To express this we start by using that $\gamma_{1}(k)$ and $\gamma_{2}(k)$ are monotonically increasing and decreasing function of $k$ respectively. Then we find the first derivative of the expression given above.

$$\frac{\partial E\left(\operatorname{EMSE}\left(\hat{\beta}_{R R}\right)\right)}{\partial k}=\frac{\partial \gamma_{1}(k)}{\partial k}+\frac{\partial \gamma_{2}(k)}{\partial k}$$

$$\frac{\partial E\left(E M S E\left(\hat{\beta}_{R R}\right)\right)}{\partial k}=-2 \sum_{i=1}^{p} \frac{\lambda_{j i}}{\left(\lambda_{j i}+k\right)^{3}}+2 k \sum_{i=1}^{p} \frac{\lambda_{j i} \alpha_{j i}^{2}}{\left(\lambda_{j i}+k\right)^{3}}$$

In this situation, it make us unable to conclude that the first derivative of $\gamma_{1}(k)$ will be permanently negative and the first derivative of $\gamma_{2}(k)$ will be permanently positive. Hence $\frac{\partial E\left(L_{M L}^{2}\right)}{\partial k}=0$, but it is compulsory to show that the ridge parameter k will be always greater than zero. For such

situation $\frac{\partial E\left(L_{R R}^{2}\right)}{\partial k}<0$, they have to show that $E\left(L_{R R}^{2}\right)<E\left(L_{M L}^{2}\right) .$ Then the value of ridge parameter $k$, can be written as

$$k<\frac{1}{\alpha_{j i}^{2}}$$

Moreover, by equating equation to zero the optimal value of the ridge parameter k can be showed as given below.

$$k=\frac{1}{\alpha_{j i}^{2}}$$

The above value is the optimal value of ridge parameter k.

\subsection{The Proposed NTPE for multinomial logit model}

As a remedy to the problem of inflated variance of the ML estimator we are going to purpose a New Two-Parameter Estimator for Multinomial Logit Model. Since $\hat{\beta}_{M L}$ is found using the weighted least square (WLS) algorithm it approximately minimizes the weighted sum of square of error (WSSE). Hence $\hat{\beta}_{M L}$ can be seen as the optimal estimator in a WSSE sense. Following, we have the following estimator:

$$\hat{\beta}_{N T P}=\left(X^{\prime} W X+I_{p}\right)^{-1}\left(X^{\prime} W y+(d-k) \hat{\beta}_{M L R}\right)$$

$$\hat{\beta}_{N T P}=\left(X^{\prime} W X+I_{p}\right)^{-1}\left(X^{\prime} W X+d I_{p}\right) \hat{\beta}_{M L R}$$

$$\hat{\beta}_{N T P}=\left(X^{\prime} W X+I_{p}\right)^{-1}\left(X^{\prime} W X+d I_{p}\right)\left(X^{\prime} W X+k I_{p}\right)^{-1} X^{\prime} W y$$

Where $W=\pi_{j}\left(1-\pi_{j}\right)$ and $z$ is a vector where the $l^{\text {th }}$ element equals $z_{i}=\log \left(\pi_{j}\right)+\frac{\left(y_{j}-\pi_{j}\right)}{\pi_{j}\left(1-\pi_{j}\right)}$. Which is the proposed new two-parameter estimator.

\subsubsection{Selection of the optimized Biasing Parameters}

The new two-parameters can be achieved by finding the optimal values of $k$ and $d$. Where an operational estimator for $k$, where $d=1, \hat{\beta}(k, 1)=\hat{\beta}(k)$, and the optimal value of $\hat{k}$ can be written as

$$\hat{k}_{o p t}=\frac{\hat{\sigma}^{2}\left(\lambda_{i}+d\right)-(1-d) \lambda_{i} \alpha_{i}^{2}}{\left(\lambda_{i}+1\right) \alpha_{i}^{2}}$$

The above value is the estimated value of k given by Hoerl and Kennard (1970). If $k=0, \hat{\beta}(k, d)$ become the Liu estimator, then the optimal value of $hat d$can be written as

$$\hat{d}_{o p t}=\frac{\sum_{i=1}^{p}\left(\hat{\alpha}_{i}^{2}-\hat{\sigma}^{2}\right) /\left(\lambda_{i}+1\right)^{2}}{\sum_{i=1}^{p}\left(\hat{\sigma}^{2}+\lambda_{i} \hat{\alpha}_{i}^{2}\right) /\left(\lambda_{i}+1\right)^{2} \lambda_{i}}$$

\subsubsection{Selection of the Biasing Parameter k via articles}

The biasing parameter k of the NTPE is selected from the article “Performance of some ridge regression estimators for the multinomial logit model.pdf”. In this article authors proposed 16 estimator’s k and concluded that k13 performed best among all of them.

$$\hat{k}_{13}=\prod_{i=1}^{p}\left(\frac{1}{q_{j i}}\right)^{\frac{1}{p}}$$

$$\text { Where } q_{j i}=\frac{\lambda_{j m a x}}{n-p+\lambda_{j \max } \hat{a}_{j i}^{2} i}$$

The other biasing parameter “d” of NTPE is selected from the article “On Liu estimators for the logit regression model.pdf.” In this article the authors proposed 5 biasing parameters of Liu estimator for the logit regression model. They concluded that among that 5 estimators D5 performed best.

$$D 5=\max \left[0, \min \left(\frac{\hat{\alpha}_{j}^{2}-\hat{\varphi}}{\frac{1}{\hat{\lambda}_{j}}+\hat{\alpha}_{j}^{2}}\right)\right]$$

\subsubsection{Selection of the Biasing Parameters by Self-selection}

we have studied in the literature and observed that the criteria for the selection of biasing parameters k and d is different. As both of the biasing parameters have same range 0 to 1. But after so many calculation we have disclosed that the biasing parameter k should be near to 0 whether the value other biasing parameter d should be near to 1.

In this study we have used these values of biasing parameters.

k=0.02

d=0.80

With the help of these values we got better results.

\subsubsection{Judging the performance of the estimators}

To examine whether the NTPE is better than the MLE we compute the MSE by using the given equation:

$$M S E=\frac{\sum_{i=1}^{R} \sum_{j=2}^{m}\left(\hat{\beta}_{j}-\beta\right)^{\prime}\left(\hat{\beta}_{j}-\beta\right)}{R}$$

\subsection{Performance of the proposed estimator under different distributed error terms. i.e. Normal, t, and Exponential Distributed}

As we know that in most of the studies the error term is normally distributed with mean 0 and constant variance. But in this study we need to use the error term distributed with different distributions because we want to see the trend and changes in the empirical results of study.

Standard Normal Distribution                    $\epsilon~N(0,1)$

T-distribution                                  $\epsilon~t(n,n-1)$

Exponential Distributed                    $\epsilon~exp(n,\lambda)$

\subsection{Performance Criterion for the Estimator}

The estimated MSE and standard deviation of the Liu parameters are considered as performance criteria to inspect the performance of the proposed and existing Liu parameters in the presence of multicollinearity. The estimated MSE is defined as

$$M S E=\frac{\sum_{i=1}^{R}\left(\hat{\beta}_{i}-\beta\right)^{\prime}\left(\hat{\beta}_{i}-\beta\right)}{R}$$

Where R is the total number of replications which are set to be 3000 and $\hat{\beta}_{j}$ is the estimated value of $\beta$ in the ith replication obtained from the Liu estimator and the ML method. The standard deviation of d is also calculated to check which estimation methods of the Liu parameter give stable MSE.

\section{The Monte Carlo Simulation}

\subsection{The design of the experiment}

The response variable of the multinomial logit model is produced by using the pseudo random numbers taken for the multinomial regression model where $$\pi_{j}=\frac{\exp \left(x_{j} \beta_{j}\right)}{\sum_{j-1}^{m} \exp \left(x_{j} \beta_{j}\right)^{2}}, \quad j=1, \ldots, p$$

Where the value of parameters in  above equation are taken so that $\beta^{\prime} \beta=1$, and the base category will be taken from the first category. The first factor we select to differ in the experimental design is the relationship among the explanatory variables. In the experimental design we choose five levels of correlation $\rho$ corresponding to 0.75,0.80,0.85,0.90,0.95 and 0.99 are considered. These level of correlation are used to generate the data with different degrees of correlation by using the formula given below.

$$x_{i j}=\left(1-\theta^{2}\right)^{1 / 2} Z_{i j}+\theta z_{i p+1}, \quad i=1, \ldots, n, i=1, \ldots, p$$

Where $Z_ij$ are the standard normally distributed pseudo random numbers. There are two factors the differ of sample size and the number of independent variables affect the mean square error and the performance of the estimators. Where this idea is take from the previous studies, in which Muniz and Kibria, 2009; Mansson and Shukur, 2011 are notable. To get the valid results from the multinomial regression the researchers Mansson and Shukur (2011) increase the sample size with the number of independent variables. If the sample size is adjusted and make little correction in the degree of freedom, then the result will be more meaningful (Muniz, Kibria and Shukur, 2012).

\begin{table}[h]

\begin{tabular}{lclllll}

\hline

\multicolumn{7}{l}{Table 3.1: Combination of the sample Sizes}     \\ \hline

Number of explanatory variables & \multicolumn{6}{c}{Sample Sizes} \\ \hline

& 50  & 75 & 100 & 150 & 200 & 250 \\ \hline

4                               & *   & *  & *   & *   &     &     \\ \hline

6                               &     & *  & *   & *   & *   &     \\ \hline

8                               &     &    & *   & *   & *   & *   \\ \hline

\end{tabular}

\end{table}

\subsection{A real application: rental price data set

}

Since the theoretical and Monte Carlo simulation evidence are not enough to judge the performance of the proposed estimators. Therefore, we used an empirical application on rental price data set. This data set was taken from Späeth  and consists of 67 observations where the response variable is the rental price per acre of the given variety of grass (y) and three explanatory variables. These explanatory variables include the average cost of rent per acre of arable land in dollars (x1), the number of milk cows per square mile (x2) and the difference between pasturage and arable land (x3). The main objective of this data set is to investigate the rent structure with respect to a particular variety of grass. Before statistical modelling, it is more appropriate to test the probability distribution of the response variable. On the basis of the Cramér-von Mises test, we find that the rental price data set well fits the multinomial distribution with test statistic (p-value) given as 0.0969 (0.1239). Therefore, we apply the MNL model to investigate the rent structure with respect to a particular variety of grass instead of the LRM. We also observed through condition index (CI = 1387.65) which means that the data set is highly multicollinear. The estimate of $\phi$ is $\hat{phi}= 0.07700553$. In this application, the reciprocal link function is used. The estimated coefficients, standard errors and estimated MSE of the ML and NTPE with best parameter are presented. The NTPE with optimized parameters gains efficiency over the ML estimator in the form of the estimated MSE. It shows that the average cost of rent per acre of arable land and the number of milk cows per square mile have a negative impact while the difference between pasturage and arable land has a positive impact on the rental price per acre of the given variety of grass. We observed that a higher value of rent per acre of arable land and the milk cows per square mile indicates a low rent structure with respect to a particular variety of grass. While the positive impact of the difference between pasturage and arable land indicates a high rent structure. One can easily see that the standard errors of the variables x1, x2 and x3 increase when applying the common ML estimator. When applying the NTPE with the optimized parameter it was shown in theoretical and simulation study that it performs superior to the ML estimator.

\end{document}

\end{document}