Maximum Likelihood, Exponential Family for Beta Prior With Binomial Distribution

Bayes Rule

We know the Bayes rule. How does it relate to machine learning? Bayesian inference is based on using probability to represent all forms of uncertainty.

[Uncertainty]

  • Aleatory variability is the natural(intrinsic) randomness in a process; information technology is supposed irreducible and inherent natural to the procedure involved.
    • Heteroscedastic: No one tin sure the measurements done by your collegues are perfect..damn noise...(heteroscedastic means a dissimilar uncertainty for every input)
    • Homoscedastic: model variance? you assumes identical observation noise for every input point x? Instead of having a variance being dependent on the input x, nosotros must determine a so-chosen model precision τ and multiply it by the identity matrix I, such that all outputs y have the same variance and no co-variance among them exists. This model precision τ is the inverse ascertainment standard departure.
  • Epistemic doubt is the scientific dubiousness in the model of the process; it is supposedly reducible with better knowledge, since it is not inherent in the existent-earth process nether consideration (due to lack of knowledge and limited data..This can exist reduced in time, if more information are collected and new models are developed).

i> Inference & Prediction

  • Inference for θ aims to understand the model.
  • Prediction for Data aims to utilize the model you lot discovered.
  • Frequentists' probability refers to by events..Do experiment and that'south it.
  • Bayesians' probability refers to future events..Do update !

As Bayesians, we start with a belief, called a prior. Then nosotros obtain some information and use it to update our belief. The effect is called a posterior. Should we obtain even more data, the old posterior becomes a new prior and the cycle repeats. It's very honest. We cannot 100% rely on the experiment result. At that place is e'er a discrepency and there is no guarantee that the relative frequency of an event will friction match the true underlying probability of the event. That's why we are approximating the probability past the long-run relative frequency in Bayesian. It's like calibrating your frequentist's subjective conventionalities

P( θ | Data ) = P( Information | θ ) * P( θ ) / P( data )

[a] Prior

  • P( θ ) is a prior, our belief of what the model parameters might be.
    • Prior is a weigth or regularizor.
    • The final inference should converge to probable θ as long equally it's not zip in the prior.
    • Two aspects of your prior selection:
      • Subjective: Informative Prior ... your conventionalities based Prior

        • conjugate prior
          • a class of distributions that present the same parametric form of the likelihood and their choice is frequently related to mathematical convenience and the likelihood.
      • Objective: Non-Informative (vague) Prior when there is no information about the problem at manus.

        • Flat prior

          • Compatible, Normal with huge variance, etc. The use of a apartment prior typically yields results which are non too different from conventional statistical analysis.
        • Improper prior

          • It, in their parametric space, does non integrate to 1. For instance, in some cases Jeffery's priors are improper, but the posterior distribution is proper.
          • Jeffery'south prior is proportional to the Fisher Information, which is the expected value of the second derivative of the log-likelihood function with respect to the parameter. Although it is not-informative, improper prior, the Fisher Information quantifies the variability of the parameter based on the available data. That is, the higher the value of the Fisher Information, the more concave is the log-likelihood, thus evidencing that the data helps to approximate the quantity of involvement.
            • *He argues that any "non-informative prior" should be invariant to the parameterization(transformation) that we are using. If we create a prior that is proportional to the Sqrt(FisherInf) so the prior is invariant to the parameterization used.
        • Non-cohabit prior

          • When the posterior distribution does not appear every bit a distribution that we tin simulate or integrate.
          • It makes the posterior to have an Open up-form, but Metropolis-Hasting of MCMC solves the problem.
    • why a paricular prior was chosen?
      • The reality is that many of these prior distributions are making assumptions about the type of data we accept.
      • There are some distributions used once again and over again, but the others are special cases of these dozen or can be created through a clever combination of two or three of these simpler distributions. A prior is employed because the assumptions of the prior friction match what nosotros know about the parameter generation process. *Really, at that place are multiple effective priors for a detail trouble. A particular prior is chosen every bit some combination of analytic tractability + computationally efficiency, which makes other recognizable distributions when combined with popular likelihood functions.
      • Examplary Prior Distributions
        • Uniform Prior

          • Beta(1,one) = Unif(0,1)
          • Whether you lot use this one in its continuous instance or its discrete case, information technology is used for the same thing:
            • You accept a prepare of events that are every bit likely.
              • ex) come across "binomial likelihood" case. Unif(0,1) says θ can be whatever value (ranging from 0 to i) for any X.
          • Note, the uniform distribution from ∞ to −∞ is non a probability distribution.
            • Demand to requite lower and upper premises for our values.
            • Not used as oft as you lot'd think, since its rare nosotros want hard boundaries on our values.
        • Gaussian Prior

          • Taking a eye and spread as arguments, it states that 67% of your data is within iSD of the center, and 95% is within 2SD.
            • No need to bank check our value boundaries.
          • coming up a lot considering if you take multiple signals that come from whatever distribution (with plenty signals), their boilerplate always converges to the normal distribution. hist(np.array([np.mean(your_distribution) for i in range(your_samples)])).
        • Beta Prior [0,1]

        • Gamma Prior [0,∞]

          • It comes upwardly all over the place. The intuition for the gamma is that it is the prior on positive real numbers.
            • Now there are many ways to get a distribution over positive numbers.
              • take the absolute-value of a normal distribution and get what's called a Half-Normal distribution.
              • take the exp(Y) and Y^2...Log-Normal, and χ-square.
          • So why utilize the gamma prior?
            • If y'all use a Log-Normal, you lot are implicitly maxim that you wait the log of your variable is symmetric.
            • If you use a χ-square, you are implicitly saying that your variable is the sum of k?-squared factors, where each factor came from the normal(0, 1) distribution.
            • Some people suggest using gamma because it is cohabit with lots of distributions. so information technology makes performing a computation easier...but it would exist amend to have your priors actually encode what yous believe.
              • When gamma is a used as the prior to something similar normal, the posterior of this distribution too is a gamma.
            • The gamma distribution is the main way to encode something to be a postive number. Actually many distributions can be congenital from gamma.
              • It'southward parameters shape(thousand) and scale(θ) roughly let yous tune gamma like the normal distribution. kθ specifies the mean, and kθ^ii specifies the variance.
              • Taking the reciprocal of a variable from the gamma gives you a value from the Inv-gamma distribution.
              • If nosotros normalize this positive number, we get the Beta distribution.
                                                    def beta(a, b):     def samples(due south):         x = r.gamma(a, 1, s)         y = r.gamma(b, 1, s)         render x/(x + y)     return(samples)                                                                          
              • If we desire to a prior on "categorical", which takes every bit an argument a listing of numbers that sum to ane, we can utilise a gamma to generate k-numbers and and then normalize. This is precisely the definition of the Dirichlet distribution.
        • Heavy-tailed Prior

          • The major advantage of using a heavy-tail distribution is it'southward more than robust towards outliers (nosotros cannot exist too optimistic nearly how close a value stays well-nigh the mean..)..let'due south start to intendance outliers..
          • t-distribution can exist interpretted as the distribution over a sub-sampled population from the normal distribution sample. Since here our sample size is so minor, atypical values can occur more oft than they practice in the general population. As our sub-population grows, the t-distribution becomes the normal distribution.
            • The t-distribution tin also be generalized to non be centered at 0.
            • The parameter ν lets you state how large you believe this subpopulation to be.
          • Laplace-distribution as an interesting modification to the normal distribution(replacing exp(L2-norm) with exp(L1-norm) in the formula). A Laplace centered on 0 tin can be used to put a strong sparsity prior on a variable while leaving a heavy-tail for it if the value has stiff support for another value.

[b] Likelihood: MLE (Parameter Point Estimation)

  • P( Data | θ ) is chosen likelihood of information given model parameters. The goal is to maximize the likelihood function probability L(x,x,ten,x..|θ) to choose the best θ.

  • The formula for likelihood is model-specific.
  • People often apply likelihood for evaluation of models: a model that gives college likelihood to real information is better.
  • If one also takes the prior into account, then information technology'due south maximum a posteriori estimation (MAP). P(Information|θ) x P(θ). What information technology ways is that, the likelihood is now weighted with some weight coming from the prior. MLE and MAP are the same if the prior is uniform.

[c] Posterior: MAP (Parameter Point Interpretation)

  • P( θ | Data ), a posterior, is what we're afterward. It's a parametrized distribution over model parameters obtained from prior beliefs and data. The goal is to maximize the posterior probability 50(x,ten,ten,x..|θ)*P(θ) that is the value x Distribution to cull the all-time θ.

  • we assume the model - Joint: P(θ, Data) which is P(Data|θ) x P(θ)
  • MAP can unlike MLE, avoid overfitting. MAP gives you lot the L2 Regularization term.
  • Just we still anyway prefer to obtain Total Distribution rather than merely point guess. We want to address the uncertainty.
  • They are similar, every bit they compute a single estimate, instead of a full distribution.

*c-1) Bayesian Inference (Parameter Full Distribution Estimation)

  • "Inference" refers to how you larn parameters of your model. Unlike MLE and MAP, Bayesian inference means that it fully calculates the posterior probability distribution, hence the output is not a unmarried value only a pdf or pmf.
  • It's complex since we now have to deal with the Bear witness(with the integral computation). But if we are allowed to use conjugation method, we can do Bayesian inference since it'south easy. However, information technology's non always the case in real-earth applications. We and so need to apply MCMC or other algorithms equally a substitute for the direct integral ciphering.
  • There are three primary flavours:
    • 0. Conjugation method
      • Find a conjugate prior(very clear) based on the given likelihood then compute posterior using math!
      • It simply implies the integral of the joint is a closed form!
    • i. MCMC: a gold standard, merely slow. We nonetheless need a prior? Aye! fifty-fifty fake prior because nosotros nevertheless need the articulation!
      • It implies the integral of the articulation is an open form!
      • Obtain a posterior by sampling from the "Envelop".
    • ii. Variational inference: faster but less accurate. Its drawback is that it'south model-specific..(use when likelihood & prior is clear)
      • It implies the integral of the joint is an open form!
      • Obtain a posterior by "appropriating other distribution".
  • If you have a truly infinite computational budget, MCMC should give more accurate solution than Variational Inference that trades some accuracy for speed. With a finite budget (say one year of computation), Variational Inference tin can be more than authentic for very large models, just if the budget is large enough MCMC should give a better solution for whatever model of reasonable size.

c-ii) Variational Inference

Variational inference seeks to approximate the true posterior with an judge variational distribution, which nosotros can calculate more easily. The difference of EM-algorithm and Variational-Inference is the kind of results they provide; EM is just a point while Half dozen is a distribution. Nevertheless, they also take similarities. EM and VI tin both exist interpreted as minimizing some sort of distance between the truthful value and our guess.

  • For EM: which is the Maximum-Likelihood
    • ...assign fake param -> develop soft faux villages -> summate weighted param from the village -> obtain the MLE value of villages by developing new soft simulated villages again based on the weighted param -> Repeat from the start until the MLE value gets to the maximum...At present, finally you obtain the all-time param.
  • For VI: which is the Kullback-Leibler deviation .

The term variational comes from the field of variational calculus. Variational calculus is just calculus over functionals instead of functions. Functionals are just a function of role(inputs a function and outputs a value). For example, the KL-divergence are functionals. The variational inference algorithms are simply optimizing functionals which is how they got the name "variational Bayes".

Ready

  • Nosotros take perfect likelihood and prior. Simply nosotros don't accept Evidence. So the un-normalized posterior(joint) is always the starting signal.
  • The main idea backside variational methods is to selection a fake? posterior q(z) as a family of distributions over the latent variables with its own variational parameters. Get with the exponential family in full general? This is your FINGERS!!!!
  • And so,notice the setting of the best parameters that makes q(z) close to the posterior of interest. Utilise q(z) with the fitted parameters as a proxy for the posterior to predict most future data or to investigate the posterior distribution of the subconscious variables (Typically, the true posterior is not in the variational family).
  • Typically, in the true posterior distribution, the latent variables are not contained given the data, just if we restrict our family unit of variational distributions to a distribution that factorizes over each variable in Z (this is called a mean field approximation), our problem becomes a lot easier.
  • Nosotros can easily pick each variational distribution(V_i) when measured by Kullback Leibler (KL) divergence because we compare this Q(Z) with our un-normalized posterior that we already have (KL departure formula has a sum of terms involving 5, which nosotros can minimize...Then the estimation procedure turns into an optimization problem). Once nosotros get in at the best V*, we can use Q(Z|V*) equally our all-time approximate at the posterior.

KL-Deviation helps approximate the z that minimizes the distance b/due west Q(z) and P*(z)

  • Step_01: Select the family distribution Q called a "variational family": a pool of Q
  • Step_02: Try to approximate the full posterior P*(z) with some variational distribution Q(z) by searching the all-time matching distribution, minimizing "KL-divergence" value.
    • minimizing KL-divergence value(Due east[log Q over P]) betwixt Q(z) and P*(z)
  • Kullback Leibler-Divergence measures the difference(altitude) b/w two distributions, so we minimize this value between your variational distribution selection and the un-normalized posterior (not differ from normalized real posterior...coz the prove would become a constant...in the end.)

If yous additionally require that the variational distribution factors completely over your parameters, then this is called the variational mean-field approximation.

  • Step_01: Select the family unit distribution Q chosen a "variational family" by product of Q(z1), Q(z2),...where z is the latent variable.
  • Step_02: Effort to approximate the total posterior P*(z) with some variational distribution Q(z) by searching the best matching distribution, minimizing "KL-divergence" value.
    • minimizing KL-divergence value(E[log Q over P*]) between Q(z) and P*(z)

[d] Data value Prediction

Evidence is discussed in the process of inference (non in the prediction...?) Bayesian methods are appealing for prediction issues thanks to their ability to naturally comprise both sample variability and parameter uncertainty into a predictive distribution. Let'due south train data points X and Y. Nosotros want predict the new Y at the end. In Bayesian Prediction, the predicted value is a weighted average of output of our model for all possible values of parameters.

Real-time data?

Alternative perspective on the prediction method is Bayesian Prediction with Copulas. Handling information arriving in real time requires a flexible not-parametric model, and the Monte Carlo methods necessary to evaluate the predictive distribution in such cases tin exist too expensive to rerun each fourth dimension new data arrives. With respect to this, Bayesian Prediction with Copulas' approach facilitates the prediction without computing a posterior .

  • Concept 01> Recursive nature of the updates in predictive distribution

    However, in cases where information technology is not possible to work directly with the posterior, this natural Bayesian updating formula is out of reach.

  • Concept 02> Let'southward work direct with the posterior: DPMixture of Gaussian? Beta? some "kernel"?

    In our context of estimating the predictive distribution in existent time, information technology is not possible to look at the entire dataset all at once, thus nosotros seek the flexibility of a non-parametric model, largely to avoid potential model misspecification. That is, it is necessary to showtime with a sufficiently flexible model that can suit to the shape of the distribution equally they arrive. In these non-parametric cases, θ is not a finite-dimensional parameter, but information technology is an space-dimensional index - formula - of the distribution clusters(Gaussian, Beta, whatsoever...)that explaining the dataset. The well-nigh common strategy, in the present context of modelling densities, is the so-chosen Dirichlet process mixture model. The problem is that given the posterior formula based on the full information, when new data formula arrives, the MCMC must be rerun on the total data to get the posterior formula or the predictive density formula. This can exist prohibitively tedious, thereby motivating a fast recursive approximation.

  • Concept 03> Gaussian Copula Density

    To circumvent the aforementioned computational difficulties in Bayesian updating in the predictive models, we turn to a new strategy: A Recursive Approximation with Copulas. A Copula equally a mathematical object captures the joint behavior of 2 different Random Variables, each of which follows unlike distribution, and returns a single bivariate distribution formula. Sklar theorem implies that at that place exists a symmetric copula density formula such that That is, for each Bayesian model, in that location exists a unique sequence {formula} of copula densities. This representation reveals that it is possible to directly and recursively update the predictive distribution without assistance of MCMC. It has the advantage of straight estimating the predictive density and does not require numerical integration to compute normalising constants.

    For a Dirichlet process mixture model, with Gaussian kernel - N(x|u, i) - and DP sample prior - formula where formula.

    The Gaussian Copula Density is In detail, nosotros consider the post-obit recursive sequence of predictive densities But information technology's too complicate... On the CDF distribution part scale, the algorithm is a bit more than transparent, that is, The take-away message is that in that location exists a recursive update of the predictive density formula in the Dirichlet process mixture model formulation, characterised by a copula density. (???really???)

  • Add them up> "Recursive Algorithm"

    The choice of ρ is entirely up to the discretion of the researcher, with values closer to 1 respective to less smoothing (ρ=0.90 is a reasonable choice?). For the weights, a choice like formula=(i+ane)^-r for r ∈ (0.five, 1]...as i grows, α decreases (r=1 as a default option?). In choosing the initial guess of formula, try to capture the support of given dataset distribution? Since this predictive office is not sure (if at that place is fiddling or no data to use as a guide), we go with some kernel density? such as t-distribution?? Merely nosotros totally ignore DP prior or kernel likelihood????????


2. Modeling

In parametric method, we define a model that depends on some parameter "theta" and then we find optimal values for "theta" past taking MLE, or MAP. And as information becomes more than and more complex, we need to add together more and more parameters(think most LM's coefficients, linear? polynomial?) then we can say the number of parameters are fixed.

  • Stock-still number of parameters => and then the complexity is limited.
  • Fast Inference coz you just simply feed the weights then the prediction would be simply the scalar multiplication.
  • Merely training is complicated and takes time.

In Non-parametric method, the number of parameters depend on the dataset size. That is, as the number of data points increases, the decision boundary becomes more than and more complex.

  • Not Fixed number of parameters => and so the complexity is arbitrary.
  • Ho-hum Inference coz you have to process all the data points to make a prediction.
  • Merely training is simple coz information technology in most cases just remembers all points .

[Parametric]

  • A. Bayesian Network as PGM
    • Bayesian Network is "Directed" and "Acyclic". Information technology cannot have interdependent variables.

In the settings where information is deficient and precious and hard to obtain, it is hard to conduct a large-scale controlled experiment, thus we cannot spare whatever effort to make the best use of available input. With small data, it is of import to **quantify uncertainty** and that's precisely what Bayesian approach is good at. In Bayesian Modeling, in that location are two principal flavours:

  • B. Statistical Modeling:
    • Multilevel/Hierarchical Modeling(Regression?)
  • C. probabilistic Auto Learning approach: using data for a computer to larn automatically from it. Information technology outputs probabilistic predictions...that's why probabilistic.. besides these probabilities are only statements of belief from a classifier.
    • ane) Generative modeling: One can sample or generate examples from it. Compare with classifiers(discriminative model to model P(y|10) to discriminate between classes based on x), a generative model is concerned with joint distribution P(y,x) . Information technology's more difficult to estimate that distribution, but it allows sampling and of course one can get P(y|ten) from P(y,x).
      • LDA: You lot start with a matrix where rows are documents, columns are words and each chemical element is a count of a given word in a given document. LDA "factorizes" this matrix of size n x d into two matrices, documents/topics (northward x k) and topics/words (k x d). you tin can't multiply those 2 matrices to get the original, just since the appropriate rows/columns sum to 1, you lot can "generate" a document.

[Non-Parametric]

  • A. Bayesian non-parametrics Modeling: the number of parameters in a model can grow as more than data go available. This is similar to SVM, for example, where the algorithm chooses support vectors from the training points. Nonparametrics include Hierarchical Dirichlet Process version of LDA(where the number of topics chooses itself automatically), and Gaussian Processes.
    • 1) Gaussian Processes: It is somewhat like to SVM - both use kernels and have like scalability(which has been vastly improved throughout the years by using approximations).

      • A natural formulation for GP is regression , with classification as an afterthought. For SVM, information technology's the other way effectually.
      • Equally most "normal" methods provide point estimates, "Bayesian" counterparts like GP also output uncertainty estimates while SVM are not.
      • Fifty-fifty a sophisticated method like GP usually operates on an supposition of homoscedasticity, that is, "uniform noise" levels. In reality, dissonance might differ across input space (be heteroscedastic).
      • GP outputs a mean curve and CI(cov) curves.
    • 2) Dirichlet Process: The space-dimensional generalization of the Dirichlet distribution is the Dirichlet process. In short, the Dirichlet Process is a generalization of Dirichlet distributions where a sample from DP generates a Dirichlet distribution. Interestingly, the generalization allows the Dirichlet Procedure to have an infinite number of components (or clusters), which means that at that place is no limit on the number of Hyper-parameters. Using DP, we sample proportion of each element in a vector or multinomial random variable from the undefined dimension that can get to infinity.

> Example 01. Linear Model..some different ways to accost Coefficients and error!

  • a) Frequentist LM

    • typically go through the process of checking the 1.residuals against a set of assumptions, 2.adjusting/selecting features, 3.rerunning the model, iv.checking the assumptions again....
      • Frequentist diagnose is based on the fitted model using MLE of the model parameters.
        • "likelihood": f(ten|β)
        • "likelihood part": L(ten,x,x,x|β) past fitting a distribution to the certain data then...producting them, then differentiating to get the best β. But the issue is just a point estimate(likewise field of study to the overfitting issue)...information technology cannot address Uncertainty !
        • subject field to overfitting!
  • b) Bayesian Hierarchical LM

    • Information technology allows a useful mechanism to deal with insufficient data, or poorly distributed information. If nosotros have fewer data points, the posterior distribution will be more spread out. As the corporeality of information points increases, the likelihood washes out the prior.
    • Information technology puts a prior on the coeffients and on the noise so that in the absence of data, the priors tin can take over !
    • One time fitting it to our information, we tin can ask:
      • What is the estimated linear human relationship
        • what is the conviction on that relation, and the full posterior distribution on that relation?
      • What is the estimated noise and the full posterior distribution on that racket?
      • What is the estimated gradient and the full posterior distribution on that gradient?
  • Posterior Ciphering by Bayesian Inference: How to avoid calculating the Evidence?

    • A> When we want to get the model parameter, the Evidence is always a trouble. There is a way to avoid calculating the **Evidence**. In MAP, nosotros don't need the "Evidence". Just the problem is that we cannot use its result as a prior for the next footstep since the output is a single point guess.
      • Beneath is MAP for LM parameter vector w.
      • The result says information technology's the traditional MLE value + L2 regularization term (because of the prior) that set up overfitting.
      • But information technology notwithstanding does not have any representation of Uncertainty!
    • B> There is another way to avoid computing the **Bear witness** - Utilize Conjugate prior. We can, simply do not need to compute the Show.

      • Cohabit Prior equally a member of certain family distributions, is conjugate to a likelihood if the resulting posterior is too the member of the aforementioned family.

        • Discrete Likelihood
          • Beta prior is cohabit to Bernoulli likelihood. (so Bernoulli model? so choose Beta)
          • Beta prior is conjugate to Binomial likelihood. (so Binomial model? then choose Beta)
          • Dirichlet prior is conjugate to Muiltinomial likelihood. (so Multinomial model? and then choose Dirichlet)
          • Gamma prior is cohabit to Poisson likelihood. (and so Possion model? then choose Gamma)
          • Beta prior is conjugate to Geometric likelihood. (so Geometric model? then choose Beta)
        • Continous Likelihood
          • Gaussian prior is conjugate to Gaussian likelihood + known SD.
          • Changed Gamma prior is conjugate to Gaussian likelihood + Known μ.
          • Pareto prior is conjugate to Uniform likelihood.
          • Gamma prior is conjugate to Pareto likelihood.
          • Gamma prior is conjugate to Exponential likelihood.
        • If the likelihood is a member of Exponential-family unit, it always guarantees the presence of the conjugate prior.
      • Gaussian Prior for Gaussian likelihood + known SD

      • At present we can take advantage of having access to the full posterior distribution of the model parameter(Coefficient): we can either obtain a bespeak estimator from this distribution (e.chiliad. posterior mean, posterior median, ...) or deport the aforementioned assay using this guess...now we can say Dubiety .
      • Check the goodness of fit of the estimated model based on the predictive residuals. It is possible to conduct the aforementioned type of diagnose analysis of Frequentist'due south LM.
    • C> To approximate the posterior, nosotros use the technique of drawing random samples from a posterior distribution as ane application of Monte Carlo methods.

        1. Specify a prior π(β).
        1. Create a model mapping the training inputs to the training outputs.
        1. Have a MCMC algorithm describe samples from the posterior distributions for the parameters.

3> Model Comparison

  • What do you want the model to do well at?
  • How both regularizing priors, and information criteria assist you better and estimate the "out-of-sample"(yet-to-be-observed) deviance of a model ?
    • deviance: approximation of relative distance from "perfect accuracy".

(A) Information Theory

What is "information"? How much we take learned? It refers to the reduction in uncertainty when we learn an consequence.

1) Entropy and Incertitude

How to measure uncertainty? There is merely one function: Information Entropy.

              p <- c(0.3, 0.7) H <- -sum( p*log(p) )                          

Information technology gives...0.61: information technology's quite uncertain... High Entropy..Chaos..high disorder..very big incertitude!

              p <- c(0.01, 0.99) H <- -sum( p*log(p) )                          

It gives...0.06: That'southward nice! it'southward quite certain... Depression Entropy..depression disorder..very pocket-size uncertainty!

ii) Entropy and Accurateness

How to use Data Entropy to say how far your model is from the target model? The central lies in: Kullback-Leibler Divergence.

  • Suppose at that place is a true distribution (with p1, p2,..), simply we only have a slightly different distribution (with q1, q2,..) to describe the truthful distribution. How much additional uncertainty we might introduce equally a event?
  • The boosted uncertainty introduced from using the distribution in our mitt can exist expressed equally: E[ log(p)-log(q) ], but there is a catch! Yous demand to utilise "cross entropy"..because you are trying to find p, using q.

Since predictive models specify probabilities of events(obv), We tin use KL Divergence to compare the accurateness of models.

3) Difference Estimation

Then How to estimate the divergence? There is "no way" to admission the target p straight. Luckily, we simply compare the divergences of dissimilar candidates - r vs q -, using 'deviance' (model fit measure). But we demand to know East[log(r)] and E[log(q)], which are exactly similar what you've been using in MLE. Hence, summing the log probabilities of (x,r) or (x,q) gives an approximation of Eastward[log(r)] or E[log(q)], but we don't have to know the real parameter p inside the expectation terms. So nosotros can compare E[log(r)] VS E[log(q)] to get an estimate of the relative altitude of each model from the target. (Having said that, still, the absolute magnitude of E[log(r)] or E[log(q)] cannot be known, we do not know they are good model or bad model. But the divergence E[log(r)] - E[log(q)] informs about divergence from the target p.

SUM( log(pdf) ) (total log probability score) is the gold standard way to compare the predictive accurateness of different models. Information technology is an estimate of the cross entropy: E[log(pdf)] w/o multiplying the probability term. To compute this, nosotros need the total posterior distribution because in order to get log(pdf), nosotros need to find log( E[probability of obv] ) where the East[.] is taken over the full posterior distribution of θ...This is called total "Log-point wise predictive density" a.m.a "total log probability score".

Deviance is simply the "log pointwise predictive density" multiplied by - 2. The model with a "smaller deviance value" is expected to evidence higher accurateness.

However, merely like R^2,...it'due south a measure of retrodictive accuracy rather than predictive accuracy. It ever improves as the model gets complex. Then they are absurd!

  • In-sample (with training prepare)
  • Out-of-sample (with testing set)

Then...what predictive criteria are available?

iv) Information Criteria

  • AIC (Akaike IC) is shit coz...
      1. prior should exist flat
      1. posterior should follow Gaussian
      1. sample size should be greater than the No.of parameters
  • DIC (Deviance IC)
      1. Ok.
      1. posterior should follow Gaussian
      1. sample size should be greater than the No.of parameters
  • WAIC (Widely Applicable IC)
    • No Assumptions....OK?

WAIC is simply the "log pointwise predictive density" plus a "penalty proportional to the variance" in the prediction.

The penalty term means the summation of the variance in log probability(likelihood) for each obv. Each obv has its ain penalty score that measures overfitting take chances (nosotros are assessing overfitting take chances at th level of each obv). FYI, this penalty term, in the analogy of AIC, is the number of parameters.

Q. Which observed information point contribute to overfitting the nearly? Q. WAIC computation on train/test gives different value because...the sample size scales the deviance. It is the distance b/w models that is useful, non the absolute value of the deviance.

v) Comparing

When in that location are several plausible (and hopefully un-confounded) models for the same set of observations, how should we compare the accurateness of these models? Post-obit the fit to the sample is no proficient, considering fit will always favor more circuitous models. Data divergence is the correct measure of model accuracy, only even information technology will just lead u.s. to cull more than and more complex and wrong models. Nosotros need to somehow evaluate models out-of-sample. How can we practise that? A meta-model of forecasting tells us two important things.

  • First, flat priors produce bad predictions. Regularizing priors—priors which are skeptical of extreme parameter values—reduce fit to sample just tend to improve predictive accurateness.
  • Second, we can get a useful gauge of predictive accurateness with the criteria: CV, Pareto Smoothed Importanmt Sampling-CV, and WAIC.

Regularizing priors and CV / PSIS-CV / WAIC are complementary. Regularization reduces overfitting, and predictive criteria measure overfitting... ....T.B.D

(B) Non Parametric Prior

(C) Empirical Bayes Prior

casadygaver1985.blogspot.com

Source: https://github.com/mainkoon81/ooo-Bayesian-Introduction

0 Response to "Maximum Likelihood, Exponential Family for Beta Prior With Binomial Distribution"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel