You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# using statsmodel API (imported as smf)# pass Patsy model formula & data dictionarymodel=smf.ols(
"y ~ 1 + x1 + x2 + x1:x2",
df_data)
result=model.fit()
result.params
/home/bjpcjp/anaconda3/lib/python3.6/site-packages/statsmodels/stats/stattools.py:72: ValueWarning: omni_normtest is not valid with less than 8 observations; 5 samples were given.
"samples were given." % int(n), ValueWarning)
OLS Regression Results
Dep. Variable:
y
R-squared:
1.000
Model:
OLS
Adj. R-squared:
1.000
Method:
Least Squares
F-statistic:
6.860e+27
Date:
Sun, 19 May 2019
Prob (F-statistic):
1.46e-28
Time:
09:19:27
Log-Likelihood:
151.41
No. Observations:
5
AIC:
-296.8
Df Residuals:
2
BIC:
-298.0
Df Model:
2
Covariance Type:
nonrobust
coef
std err
t
P>|t|
[0.025
0.975]
Intercept
-0.5556
6.16e-14
-9.02e+12
0.000
-0.556
-0.556
x1
1.8889
2.3e-13
8.22e+12
0.000
1.889
1.889
x2
-0.8889
7.82e-14
-1.14e+13
0.000
-0.889
-0.889
x1:x2
-7.772e-16
7.22e-15
-0.108
0.924
-3.18e-14
3.03e-14
Omnibus:
nan
Durbin-Watson:
0.071
Prob(Omnibus):
nan
Jarque-Bera (JB):
0.399
Skew:
0.401
Prob(JB):
0.819
Kurtosis:
1.871
Cond. No.
6.86e+17
Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The smallest eigenvalue is 1.31e-31. This might indicate that there are strong multicollinearity problems or that the design matrix is singular.
fromcollectionsimportdefaultdict
data=defaultdict(lambda: np.array([1,2,3]))
# intercept & a correspond to a constant and a linear dependence on a.patsy.dmatrices("y ~ a", data=data)[1].design_info.term_names
['Intercept', 'a']
# now we have 2nd independent variable, "b"patsy.dmatrices("y ~ 1 + a + b", data=data)[1].design_info.term_names
['Intercept', 'a', 'b']
# intercept can be removedpatsy.dmatrices("y ~ -1 + a + b", data=data)[1].design_info.term_names
['a', 'b']
# auto expansionpatsy.dmatrices("y ~ a * b", data=data)[1].design_info.term_names
['Intercept', 'a', 'b', 'a:b']
# higher-order expansionspatsy.dmatrices("y ~ a * b * c", data=data)[1].design_info.term_names
# Category data encoding is binary by default.# Encoding can be changed/extended.# Example: encoding categories with orthogonal polynomials instead of treatment indicators# using (C(a,Poly))patsy.dmatrices("y ~ - 1 + C(a, Poly)", data=data, return_type="dataframe")[1]
# QQ plot - compares sample quantiles with theoretical quantiles# should be similar to a straight line if sampled values are normally distributed.fig, ax=plt.subplots(figsize=(8, 4))
smg.qqplot(result.resid, ax=ax)
fig.tight_layout()
#fig.savefig("ch14-qqplot-model-1.pdf")
Above significant deviation == unlikely to be a sample from normal distribution. We need to refine the model.
Can add missing interaction to Patsy formula, then repeat.
/home/bjpcjp/anaconda3/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Discrete regression, logistic regression
Requires different techniques, because linear regression requires a normally distributed continuous variable.
statmodel discrete regression support: Logit (logistic regression), Probit (uses CMF of normal distribution, transforms linear predictor to [0,1], MNLogit (multinomial logistic regression), and Poisson classes.
# attempt to fit to a poisson process# patsy formula "discoveries ~ 1" == model discoveries variable with only intercept coeff.model=smf.poisson("discoveries ~ 1", data=df)
result=model.fit()
result.summary()
Optimization terminated successfully.
Current function value: 2.168457
Iterations 1
Poisson Regression Results
Dep. Variable:
discoveries
No. Observations:
100
Model:
Poisson
Df Residuals:
99
Method:
MLE
Df Model:
0
Date:
Sun, 19 May 2019
Pseudo R-squ.:
0.000
Time:
10:19:48
Log-Likelihood:
-216.85
converged:
True
LL-Null:
-216.85
LLR p-value:
nan
coef
std err
z
P>|z|
[0.025
0.975]
Intercept
1.1314
0.057
19.920
0.000
1.020
1.243
# lambda param of poisson distribution via exponential function# use to compare histogram of observed counts vs theoretical resultslmbda=np.exp(result.params)
/home/bjpcjp/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: VisibleDeprecationWarning: Passing `normed=True` on non-uniform bins has always been broken, and computes neither the probability density function nor the probability mass function. The result is only correct if the bins are uniform, when density=True will produce the same result anyway. The argument will be removed in a future version of numpy.
"""Entry point for launching an IPython kernel.
fig, ax=plt.subplots(1, 1, figsize=(12, 4))
ax.bar(
k[:-1], v,
color="steelblue", align='center',
label='Dicoveries per year')
ax.bar(
k-0.125, X_ci_l.pmf(k),
color="red", alpha=0.5, align='center', width=0.25,
label='Poisson fit (CI, lower)')
ax.bar(
k, X.pmf(k),
color="green", align='center', width=0.5,
label='Poisson fit')
ax.bar(
k+0.125, X_ci_u.pmf(k),
color="red", alpha=0.5, align='center', width=0.25,
label='Poisson fit (CI, upper)')
ax.legend()
fig.tight_layout()
#fig.savefig("ch14-discoveries-per-year.pdf")# conclusion:# dataset NOT well described by poisson process
Time series
Not same as regular regression - time series samples can't be regarded as independent random samples.
example model type for time series = autoregressive (AR) model - future value depends on "p" earlier values. AR = special case of ARMA (autoregressive with moving average) model.