Machine Learning 2021. 3. 10. 15:06

scikits machine Learning

잔차 :종속변수와 독립변수와의 관계를 밝히는 통계모형에서 모형에 의하여 추정된 종속변수의 값과 실제 관찰된 종속변수 값과의 차이이다. 이 차이는 오차(error)로도 해석되며, 통계모형이 설명하지 못하는 불확실성 정보이다.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
# scipy : optimization, interpolation, 미적분, fft : matlab

x= np.array([0.0,1.0,2.0,3.0,4.0,5.0])
y= np.array([0.0,0.8,0.9,0.1,-0.8,-1.0])
z = np.polyfit(x,y,3) # 방정식의 계수를 찾는다 , 3차원
print(z)
p = np.poly1d(z) # 방정식을 만들어주고있음
print("방정식",p)
print(p(0.5))
print(p(3.5))
print(p(10))
print(p(3.0))

[ 0.08703704 -0.81349206 1.69312169 -0.03968254]
방정식 3 2
0.08704 x - 0.8135 x + 1.693 x - 0.03968
0.6143849206349201
-0.347321428571432
22.579365079365022
0.06825396825396512

경사하강법 2차원 예시 그래프

local momentum으로 표현

p30= np.poly1d(np.polyfit(x,y,3)) # 방정식으로 변환해줌
print(p30(4)) # 방정식으로 구한 값을 리턴
xp = np.linspace(-2,6,100) # 100등분 -2부터 6까지 
_ = plt.plot(x,y,'.',xp,p(xp),'-',xp,p30(xp),'--')
plt.ylim(-2,2)
plt.show()

경사하강법 3차원 예시 그래프

global momentum의 시각화자료

2차원으로 보았을때 가장 낮아보였던 곳이 3차원으로 보니 local momentum이라서 더 최적화를 위해서

global momentum으로 이동

import matplotlib as mpl
from mpl_toolkits.mplot3d import Axes3D
def fm(p):
    x,y = p
    return (np.sin(x) + 0.05 * x ** 2
           + np.sin(y) + 0.05 * y **2)

x= np.linspace(-10,10,50)
y= np.linspace(-10,10,50)

X,Y = np.meshgrid(x,y)
Z =fm((X,Y))

fig = plt.figure(figsize=(9,6))
ax= fig.gca(projection='3d')
surf = ax.plot_surface(X,Y,Z,rstride=2,cstride=2,cmap=mpl.cm.coolwarm, linewidth=0.5,antialiased=True)
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('f(x,y)')
fig.colorbar(surf, shrink=0.5,aspect=5)
# 모멘텀은 가던방향으로 진동을 더줘서 타고 내려가게끔
# GD : Gradient descent : 경사하강법 => ANN (Artificial neural network)
# 지역해 문제 -> momentum
# fitting 시간 문제  : 학습율 문제 : 처음은 많이 점점 작게
# Adam optimizer

scipy를 이용해서 ANN 진행

import scipy.optimize as spo
def fo(p):
    x,y = p
    z = np.sin(x) + 0.05 * x ** 2 + np.sin(y) + 0.05 * y **2
    if output == True:
         print('%8.4f %8.4f %8.4f' % (x,y,z))
    return z
output = True
opt1 = spo.brute(fo, ((-10,10.1,5),(-10,10.1,5)),finish=None)
print(opt1)
output = False
opt1 = spo.brute(fo, ((-10,10,0.1),(-10,10.1,5)),finish=None)
print(opt1)
# ANN으로 최적해를 찾는다. 행렬곱을 이용해서

-10.0000 -10.0000  11.0880
-10.0000  -5.0000   7.7529
-10.0000   0.0000   5.5440
-10.0000   5.0000   5.8351
-10.0000  10.0000  10.0000
-5.0000 -10.0000   7.7529
-5.0000  -5.0000   4.4178
-5.0000   0.0000   2.2089
-5.0000   5.0000   2.5000
-5.0000  10.0000   6.6649
  0.0000 -10.0000   5.5440
  0.0000  -5.0000   2.2089
  0.0000   0.0000   0.0000
  0.0000   5.0000   0.2911
  0.0000  10.0000   4.4560
  5.0000 -10.0000   5.8351
  5.0000  -5.0000   2.5000
  5.0000   0.0000   0.2911
  5.0000   5.0000   0.5822
  5.0000  10.0000   4.7471
10.0000 -10.0000  10.0000
10.0000  -5.0000   6.6649
10.0000   0.0000   4.4560
10.0000   5.0000   4.7471
10.0000  10.0000   8.9120
[0. 0.]
[-1.4  0. ]

output = True
opt2 = spo.fmin(fo, opt1,xtol=0.001, ftol=0.001, maxiter=15,maxfun=20)
opt2
fm(opt2)
output = False
spo.fmin(fo, (2.0,2.0), maxiter=250) # 할강단체법

-1.4000   0.0000  -0.8874
-1.4700   0.0000  -0.8869
-1.4000   0.0003  -0.8872
-1.3300   0.0003  -0.8825
-1.4350   0.0001  -0.8878
-1.4350  -0.0002  -0.8880
-1.4525  -0.0004  -0.8879
-1.4700  -0.0001  -0.8870
-1.4175  -0.0000  -0.8878
-1.4175  -0.0003  -0.8881
-1.4088  -0.0005  -0.8881
-1.4263  -0.0006  -0.8885
-1.4306  -0.0009  -0.8888
-1.4044  -0.0012  -0.8887
-1.4263  -0.0016  -0.8895
-1.4350  -0.0022  -0.8900
-1.4613  -0.0019  -0.8892
-1.4656  -0.0032  -0.8903
-1.4831  -0.0044  -0.8905
-1.4569  -0.0046  -0.8920
-1.4547  -0.0060  -0.8934
Warning: Maximum number of function evaluations has been exceeded.
Optimization terminated successfully.
         Current function value: 0.015826
         Iterations: 46
         Function evaluations: 86
array([4.2710728 , 4.27106945])

선형회귀(linear regression) : 기울기와 절편을 이용한 예측

전제조건 (통계학)
- 선형성 : 비선형성 = scikit에서는 polynomial + LR 선형회귀를 이용해서 비선형회귀를 풀고있음
  - 비선형 : 과적합
- 정규성 : 오차가 정규분포를 띈다.
- 독립성 : 변수간 다중공선성(과적합), 자기회귀(시계열)
  - scikits : lasso(절대값규제), Ridge(제곱), Elsticnet
  - Regulization은 에러가 이미 존재한다고 보고 회귀
- 등분산성 : 이산성 고려한 모델을 사용

from sklearn.datasets import make_regression
import statsmodels.api as sm
bias = 100
# 절편
# 속성
# 계수 : 기울기
X0, y, w = make_regression(n_samples=200, n_features=1, bias=bias, noise =10, coef=True,random_state=1)
print(X0[:5,:])
# 선형회귀시 범주형은 상수로 등록 statsmodels 에서 회귀 분석시
# 상수별로 회귀분석을 실행
X = sm.add_constant(X0) # 상수항을 고려해라 ( 절편을 고려해라 )
print(X[:5,:])
y= y.reshape(len(y),1)
print(w) # 86.44

[[ 0.23249456]
[-0.03869551]
[ 0.5505375 ]
[ 0.50318481]
[ 2.18697965]]
[[ 1.          0.23249456]
[ 1.         -0.03869551]
[ 1.          0.5505375 ]
[ 1.          0.50318481]
[ 1.          2.18697965]]
86.44794300545998

import numpy as np
# 행렬곱 연산자 -> ANN으로 변환됨(Artifitial neural network)
w = np.linalg.inv(X.T @ X) @ X.T @ y # , 역행렬
w

array([[99.79150869],
[86.96171201]])

OLS 선형회귀

방정식 규칙을 찾아주는 함수이고, 전체적인 통계표를 제공합니다.

보면 정말 버릴게 하나도 없는 선형회귀에 앞서서 통계치로 모든걸 표현해줍니다.

t-value와 p-value를 함께 설명해주어서 y(종속변수)에 대한 x(독립변수)들의 연관성이나, x(독립변수)끼리의 자기상관성등 한번에 알려주는 친절한 함수이지요.

숫자로된 데이터들을 이용해서 선형회귀에 접근하기전 필수적으로 사용해야합니다.

이걸로 머신러닝에 앞서서 x들이 y에 어떤 영향을 미치는지만 확인했습니다.

# Prob (F-statistic): 값이 0미만일때 모델이 유의함.
# Durbin-Watson:                   1.690 / 2에 근접할수록 자기상관성이있음
# Jarque-Bera (JB):                1.059 / 0이 자기정규성
# Skew:                           0.121 높을수록 비대칭
# Prob(JB):                        0.589 / 외도의 기준접
# Kurtosis:                       3.262 / 표준정규분포 3 임
# Cond. No.                         1.16 / 다중공선성이 있으면 에러가뜸

import statsmodels.api as sm
# ordinary least square : 회귀모델
# 방정식의 규칙을 찾아내라
# y = a^x + b
model = sm.OLS(y,X) # statsmodels 에서 ordinary least square를 이용
                    # 선형 회귀
result= model.fit() 
print(result.summary())

result.params
# Dep. Variable:                      y 종속변수
# 설명력 = 회귀제곱합 / 총오차제곱합
# F value : 분산비

# Prob (F-statistic): 값이 0미만일때 모델이 유의함.
# Durbin-Watson:                   1.690 / 2에 근접할수록 자기상관성이있음
# Jarque-Bera (JB):                1.059 / 0이 자기정규성
# Skew:                           0.121 높을수록 비대칭
# Prob(JB):                        0.589 / 외도의 기준접
# Kurtosis:                       3.262 / 표준정규분포 3 임
# Cond. No.                         1.16 / 다중공선성이 있으면 에러가뜸

sklearn

from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
import pandas as pd

boston=load_boston()
#

dfX = pd.DataFrame(boston.data, columns = boston.feature_names)
dfy = pd.DataFrame(boston.target, columns = ["MEDV"])
#
dfy.head() # 평균값으로 집값 예측

print(dfX.head)

# 속성의 열이름 확인
boston.feature_names

array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')

변수가 13 -> 개수
독립변수(data) -> 종속변수(target)
fit : 계수(parameter)를 구한다.

model_boston = LinearRegression().fit(boston.data, boston.target)

# 결과 속성을 이용해서 결과 확인가능
print(model_boston.coef_) # coefficient 계수
model_boston.intercept_ #

[-1.08011358e-01  4.64204584e-02  2.05586264e-02  2.68673382e+00
-1.77666112e+01  3.80986521e+00  6.92224640e-04 -1.47556685e+00
  3.06049479e-01 -1.23345939e-02 -9.52747232e-01  9.31168327e-03
-5.24758378e-01]
36.459488385089855

predictions = model_boston.predict(boston.data)

predictions

array([30.00384338, 25.02556238, 30.56759672, 28.60703649, 27.94352423,
       25.25628446, 23.00180827, 19.53598843, 11.52363685, 18.92026211,
       18.99949651, 21.58679568, 20.90652153, 19.55290281, 19.28348205,
       19.29748321, 20.52750979, 16.91140135, 16.17801106, 18.40613603,
       12.52385753, 17.67103669, 15.83288129, 13.80628535, 15.67833832,
       13.38668561, 15.46397655, 14.70847428, 19.54737285, 20.8764282 ,
       11.45511759, 18.05923295,  8.81105736, 14.28275814, 13.70675891,
       23.81463526, 22.34193708, 23.10891142, 22.91502612, 31.35762569,
       34.21510225, 28.02056414, 25.20386628, 24.60979273, 22.94149176,
       22.09669817, 20.42320032, 18.03655088,  9.10655377, 17.20607751,
       21.28152535, 23.97222285, 27.6558508 , 24.04901809, 15.3618477 ,
       31.15264947, 24.85686978, 33.10919806, 21.77537987, 21.08493555,
       17.8725804 , 18.51110208, 23.98742856, 22.55408869, 23.37308644,
       30.36148358, 25.53056512, 21.11338564, 17.42153786, 20.78483633,
       25.20148859, 21.7426577 , 24.55744957, 24.04295712, 25.50499716,
       23.9669302 , 22.94545403, 23.35699818, 21.26198266, 22.42817373,
       28.40576968, 26.99486086, 26.03576297, 25.05873482, 24.78456674,
       27.79049195, 22.16853423, 25.89276415, 30.67461827, 30.83110623,
       27.1190194 , 27.41266734, 28.94122762, 29.08105546, 27.03977365,
       28.62459949, 24.72744978, 35.78159518, 35.11454587, 32.25102801,
       24.58022019, 25.59413475, 19.79013684, 20.31167129, 21.43482591,
       18.53994008, 17.18755992, 20.75049026, 22.64829115, 19.7720367 ,
       20.64965864, 26.52586744, 20.77323638, 20.71548315, 25.17208881,
       20.43025591, 23.37724626, 23.69043261, 20.33578364, 20.79180873,
       21.91632071, 22.47107777, 20.55738556, 16.36661977, 20.56099819,
       22.48178446, 14.61706633, 15.17876684, 18.93868592, 14.05573285,
       20.03527399, 19.41013402, 20.06191566, 15.75807673, 13.25645238,
       17.26277735, 15.87841883, 19.36163954, 13.81483897, 16.44881475,
       13.57141932,  3.98885508, 14.59495478, 12.1488148 ,  8.72822362,
       12.03585343, 15.82082058,  8.5149902 ,  9.71844139, 14.80451374,
       20.83858153, 18.30101169, 20.12282558, 17.28601894, 22.36600228,
       20.10375923, 13.62125891, 33.25982697, 29.03017268, 25.56752769,
       32.70827666, 36.77467015, 40.55765844, 41.84728168, 24.78867379,
       25.37889238, 37.20347455, 23.08748747, 26.40273955, 26.65382114,
       22.5551466 , 24.29082812, 22.97657219, 29.07194308, 26.5219434 ,
       30.72209056, 25.61669307, 29.13740979, 31.43571968, 32.92231568,
       34.72440464, 27.76552111, 33.88787321, 30.99238036, 22.71820008,
       24.7664781 , 35.88497226, 33.42476722, 32.41199147, 34.51509949,
       30.76109485, 30.28934141, 32.91918714, 32.11260771, 31.55871004,
       40.84555721, 36.12770079, 32.6692081 , 34.70469116, 30.09345162,
       30.64393906, 29.28719501, 37.07148392, 42.03193124, 43.18949844,
       22.69034796, 23.68284712, 17.85447214, 23.49428992, 17.00587718,
       22.39251096, 17.06042754, 22.73892921, 25.21942554, 11.11916737,
       24.51049148, 26.60334775, 28.35518713, 24.91525464, 29.68652768,
       33.18419746, 23.77456656, 32.14051958, 29.7458199 , 38.37102453,
       39.81461867, 37.58605755, 32.3995325 , 35.45665242, 31.23411512,
       24.48449227, 33.28837292, 38.0481048 , 37.16328631, 31.71383523,
       25.26705571, 30.10010745, 32.71987156, 28.42717057, 28.42940678,
       27.29375938, 23.74262478, 24.12007891, 27.40208414, 16.3285756 ,
       13.39891261, 20.01638775, 19.86184428, 21.2883131 , 24.0798915 ,
       24.20633547, 25.04215821, 24.91964007, 29.94563374, 23.97228316,
       21.69580887, 37.51109239, 43.30239043, 36.48361421, 34.98988594,
       34.81211508, 37.16631331, 40.98928501, 34.44634089, 35.83397547,
       28.245743  , 31.22673593, 40.8395575 , 39.31792393, 25.70817905,
       22.30295533, 27.20340972, 28.51169472, 35.47676598, 36.10639164,
       33.79668274, 35.61085858, 34.83993382, 30.35192656, 35.30980701,
       38.79756966, 34.33123186, 40.33963075, 44.67308339, 31.59689086,
       27.3565923 , 20.10174154, 27.04206674, 27.2136458 , 26.91395839,
       33.43563311, 34.40349633, 31.8333982 , 25.81783237, 24.42982348,
       28.45764337, 27.36266999, 19.53928758, 29.11309844, 31.91054611,
       30.77159449, 28.94275871, 28.88191022, 32.79887232, 33.20905456,
       30.76831792, 35.56226857, 32.70905124, 28.64244237, 23.58965827,
       18.54266897, 26.87889843, 23.28133979, 25.54580246, 25.48120057,
       20.53909901, 17.61572573, 18.37581686, 24.29070277, 21.32529039,
       24.88682244, 24.86937282, 22.86952447, 19.45123791, 25.11783401,
       24.66786913, 23.68076177, 19.34089616, 21.17418105, 24.25249073,
       21.59260894, 19.98446605, 23.33888   , 22.14060692, 21.55509929,
       20.61872907, 20.16097176, 19.28490387, 22.1667232 , 21.24965774,
       21.42939305, 30.32788796, 22.04734975, 27.70647912, 28.54794117,
       16.54501121, 14.78359641, 25.27380082, 27.54205117, 22.14837562,
       20.45944095, 20.54605423, 16.88063827, 25.40253506, 14.32486632,
       16.59488462, 19.63704691, 22.71806607, 22.20218887, 19.20548057,
       22.66616105, 18.93192618, 18.22846804, 20.23150811, 37.4944739 ,
       14.28190734, 15.54286248, 10.83162324, 23.80072902, 32.6440736 ,
       34.60684042, 24.94331333, 25.9998091 ,  6.126325  ,  0.77779806,
       25.30713064, 17.74061065, 20.23274414, 15.83331301, 16.83512587,
       14.36994825, 18.47682833, 13.4276828 , 13.06177512,  3.27918116,
        8.06022171,  6.12842196,  5.6186481 ,  6.4519857 , 14.20764735,
       17.21225183, 17.29887265,  9.89116643, 20.22124193, 17.94181175,
       20.30445783, 19.29559075, 16.33632779,  6.55162319, 10.89016778,
       11.88145871, 17.81174507, 18.26126587, 12.97948781,  7.37816361,
        8.21115861,  8.06626193, 19.98294786, 13.70756369, 19.85268454,
       15.22308298, 16.96071981,  1.71851807, 11.80578387, -4.28131071,
        9.58376737, 13.36660811,  6.89562363,  6.14779852, 14.60661794,
       19.6000267 , 18.12427476, 18.52177132, 13.1752861 , 14.62617624,
        9.92374976, 16.34590647, 14.07519426, 14.25756243, 13.04234787,
       18.15955693, 18.69554354, 21.527283  , 17.03141861, 15.96090435,
       13.36141611, 14.52079384,  8.81976005,  4.86751102, 13.06591313,
       12.70609699, 17.29558059, 18.740485  , 18.05901029, 11.51474683,
       11.97400359, 17.68344618, 18.12695239, 17.5183465 , 17.22742507,
       16.52271631, 19.41291095, 18.58215236, 22.48944791, 15.28000133,
       15.82089335, 12.68725581, 12.8763379 , 17.18668531, 18.51247609,
       19.04860533, 20.17208927, 19.7740732 , 22.42940768, 20.31911854,
       17.88616253, 14.37478523, 16.94776851, 16.98405762, 18.58838397,
       20.16719441, 22.97718032, 22.45580726, 25.57824627, 16.39147632,
       16.1114628 , 20.534816  , 11.54272738, 19.20496304, 21.86276391,
       23.46878866, 27.09887315, 28.56994302, 21.08398783, 19.45516196,
       22.22225914, 19.65591961, 21.32536104, 11.85583717,  8.22386687,
        3.66399672, 13.75908538, 15.93118545, 20.62662054, 20.61249414,
       16.88541964, 14.01320787, 19.10854144, 21.29805174, 18.45498841,
       20.46870847, 23.53334055, 22.37571892, 27.6274261 , 26.12796681,
       22.34421229])

잔차 :종속변수와 독립변수와의 관계를 밝히는 통계모형에서 모형에 의하여 추정된 종속변수의 값과 실제 관찰된 종속변수 값과의 차이이다.
이 차이는 오차(error)로도 해석되며, 통계모형이 설명하지 못하는 불확실성 정보이다.

predictions - boston.target # 잔차

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

x = np.array([0.0,1.0,2.0,3.0,4.0,5.0])
y = np.array([0.0,0.8,0.9,0.1,-0.8,-1.0])
z = np.polyfit(x,y,3)
z

array([ 0.08703704, -0.81349206, 1.69312169, -0.03968254])

X = [[0.44,0.68], [0.99,0.23]]
y = [109.85,155.72]
X_test = [[0.48,0.18]]

비선형회귀 = 다차방정식으로 변환 + lR -> 비선형회귀가됨.

poly = PolynomialFeatures(degree=2) # 데이터가 2차 -> 포물선형태
# scikits에는 estimator (parameter 추정기 (기울기,절편), predict(예측))
X_ = poly.fit_transform(X) # transformer = fit + transfor (변형)
X_test_ = poly.fit_transform(X_test)

lg = LinearRegression() # 선형회귀
lg.fit(X_, y)
lg.coef_ #  다차 방정식 계수

array([ 0. , 19.4606578 , -15.92235638, 27.82874066,
-2.52988551, -14.48934431])

lg.predict(X_test_) # 비선형 예측

array([126.38247985])

로지스틱 회귀분석 logistic function => 0~1값으로 출력

로지스틱 회귀분석은 무조건 0~1값만 출력

회귀분석 -> 결과값을 logic function값으로 매핑(확률값 0 ~ 1 값)\
- 0.5를 기준 : 이진 0.5 미만은 거짓 이상은 참
- 발전 : 2개 이상의 종속변수인 경우 softmax
연속된 독립변수들의 입력 => 이산적 결과 판정
Regression : 연속된 수치 -> 결과도 연속된 수치

from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
import mglearn
X, y = mglearn.datasets.make_forge()
fig, axes = plt.subplots(1,2,figsize=(10,3))
# axes 도화지.

# Support vector classifier  서포트 벡터로 생성된 분류기
# ANN이 최적화된놈 SVC(분류)
# zip하면 두개를 묶어줌
for model, ax in zip([LinearSVC(),LogisticRegression()], axes): # 도화지한장은 SVC 한장은 로지스틱
    clf = model.fit(X,y) # 계수를 만듦
    mglearn.plots.plot_2d_separator(clf, X, fill=True, eps = 0.5,
                                   ax=ax, alpha = .7) # alpha 투명도
    mglearn.discrete_scatter(X[:,0], X[:,1],y, ax =ax)
    ax.set_title("{}".format(clf.__class__.__name__))
    ax.set_xlabel("특성 0")
    ax.set_ylabel("특성 1")
axes[0].legend()

# select : model_select, Variable(변수선택), feature extraction
from sklearn.datasets import load_breast_cancer # 유방암정보 (2차원)
from sklearn.model_selection import train_test_split
# train data = 0.75 / test data = 0.25 비율로 나눔
cancer = load_breast_cancer() # data, target(종속변수)
X_train, X_test, y_train, y_test = train_test_split(cancer.data,cancer.target, stratify=cancer.target,random_state=42) # seed 값
# 층화 
logreg= LogisticRegression().fit(X_train,y_train)
# score = 정분류를 재라
print("훈련세트 점수 : {:.3f}".format(logreg.score(X_train,y_train)))

# 과대적합
print("테스트 세트 점수 : {:.3f}".format(logreg.score(X_test,y_test)))

훈련세트 점수 : 0.955
테스트 세트 점수 : 0.958

정확도 점수 계산하는 방법

2021.03.10 - [데이터과학을 위한 통계] - 데이터과학을 위한 통계 - 10일차 (검정통계량)

데이터과학을 위한 통계 - 10일차 (검정통계량)

검정 통계량(복습) Z-value T-value F-value X^2(chi square) 3.5. T통계량 = T-value 표본 평균 차이의 통계적 지표 F-value 와의 차이는 그룹 간 차이 정도와 불확실도를 약간 변형한다. 예) 분모 : 두 표..

datacook.tistory.com

cancer.feature_names
data = pd.DataFrame(cancer.data)
print(data.head())

      0      1       2       3        4        5       6        7       8   \
0  17.99  10.38  122.80  1001.0  0.11840  0.27760  0.3001  0.14710  0.2419
1  20.57  17.77  132.90  1326.0  0.08474  0.07864  0.0869  0.07017  0.1812
2  19.69  21.25  130.00  1203.0  0.10960  0.15990  0.1974  0.12790  0.2069
3  11.42  20.38   77.58   386.1  0.14250  0.28390  0.2414  0.10520  0.2597
4  20.29  14.34  135.10  1297.0  0.10030  0.13280  0.1980  0.10430  0.1809

        9   ...     20     21      22      23      24      25      26      27  \
0  0.07871  ...  25.38  17.33  184.60  2019.0  0.1622  0.6656  0.7119  0.2654
1  0.05667  ...  24.99  23.41  158.80  1956.0  0.1238  0.1866  0.2416  0.1860
2  0.05999  ...  23.57  25.53  152.50  1709.0  0.1444  0.4245  0.4504  0.2430
3  0.09744  ...  14.91  26.50   98.87   567.7  0.2098  0.8663  0.6869  0.2575
4  0.05883  ...  22.54  16.67  152.20  1575.0  0.1374  0.2050  0.4000  0.1625

       28       29
0  0.4601  0.11890
1  0.2750  0.08902
2  0.3613  0.08758
3  0.6638  0.17300
4  0.2364  0.07678

[5 rows x 30 columns]

import numpy as np
from sklearn.model_selection import GroupKFold
# model_selection 모델데이터를 구분
# 1,2  1
# 3,4  2

# 5,6  3
# 7,8  4
X = np.array([[1,2],[3,4],[5,6],[7,8]]) # 4x2 변수2개
y = np.array([1,2,3,4]) # 종속변수
groups = np.array([0,0,2,2]) # 그룹을 구분
group_kfold = GroupKFold(n_splits=2) # 그룹을 2개로 나누겠다
group_kfold.get_n_splits(X,y,groups) 
print(group_kfold)
for train_index, test_index in group_kfold.split(X,y,groups):
    print("TRAIN:",train_index,"TEST:",test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    print(X_train, X_test, y_train, y_test)

GroupKFold(n_splits=2)
TRAIN: [0 1] TEST: [2 3]
[[1 2]
[3 4]] [[5 6]
[7 8]] [1 2] [3 4]
TRAIN: [2 3] TEST: [0 1]
[[5 6]
[7 8]] [[1 2]
[3 4]] [3 4] [1 2]

Feature_selection

from sklearn.datasets import make_friedman1 # 데이터생성지 연습용모델
from sklearn.feature_selection import RFE # recursive feature elimination 
# 재귀적으로 특징을제거
from sklearn.svm import SVR
# 열변수 10
# 관측수 50
X, y = make_friedman1(n_samples=50,n_features=10,random_state=0)
estimator = SVR(kernel="linear")  # 선형회귀
# Regression(예측) SVC:Clasfication(분류)
selector = RFE(estimator, 5 ,step=1) # 변수를 5개만 남기겠다.
# Step = 1 한번에 하나씩 제거
# 변수를 제거하는 이유 : 잡음제거를 위해서, noise 제거/ 영향력없는것들은 제거
selector = selector.fit(X,y)
selector.support_
selector.ranking_

array([1, 1, 1, 1, 1, 6, 4, 3, 2, 5])

# Text mining -> vectorizing
sample = ['problem of evil',
         'evil queen',
         'horizon problem']

# 특징추출
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
X = vec.fit_transform(sample)
print(type(X))
pd.DataFrame(X.toarray(), columns=vec.get_feature_names())

from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer() # Term frequency(단어빈도) + inverse document frequency(역문서빈도)
# 단어수 / 전체문서 => 전체문서/단어수
# 잘등장하지 않는 단어에다가 가산점을 줘서 계산.
X = vec.fit_transform(sample) 
pd.DataFrame(X.toarray(),columns=vec.get_feature_names())

분류 예측

import numpy as np
from sklearn.metrics import accuracy_score
y_pred = [0,2,1,3]
y_true = [0,1,2,3]
accuracy_score(y_true, y_pred, normalize=True) # 정확도 0.5 nomalize

# SVC : 머신러닝중 제일 많이쓰는놈 
# SVC : rbf(방사형 커널) , poly, sigmoid ( 0 ~ 1 )
# cross validation : 5 데이터를 5덩어리로 나눔
from sklearn import svm, datasets
from sklearn.model_selection import cross_val_score
iris = datasets.load_iris()
X, y = iris.data, iris.target
# Support vector : SVC 고차원을 좋아함 2차원에서 3차원으로 늘리는 방정식이 kernul임 
# gamma 값은 커널의 모양을 통제
clf = svm.SVC(gamma='scale',random_state=0) # 모델을 가리키는 이름
cross_val_score(clf,X,y,scoring = 'recall_macro',cv=5)
# recall 민감도
# 4개의 데이터, 1개의 테스트모델로 나눠서 계속 검증하는거임 다음번엔 데이터에서 테스트모델을 바꿔서 테스트
# 결과도 5개

array([0.96666667, 0.96666667, 0.96666667, 0.93333333, 1. ])

# 척도
from sklearn.metrics import confusion_matrix
y_true = [2,0,2,2,0,1]
y_pred = [0,0,2,2,0,2]
confusion_matrix(y_true, y_pred) # 혼동행렬

array([[2, 0, 0],
[0, 0, 1],
[1, 0, 2]], dtype=int64)

import numpy as np
import sklearn.metrics as metrics
y = np.array([1,1,1,1,0,0]) # 실제 값
p = np.array([1,1,0,0,0,0]) # 예측 값
accuracy = np.mean(np.equal(y,p)) # 정확도
print(accuracy)
right = np.sum(y * p == 1) # 1로 같은놈들
print(right)
precision = right / np.sum(p) # 정밀도 = TP / (TP + FP)
print(precision)
recall = right / np.sum(y) # 민감도 = TP / (TP+TG)
f1 = 2* precision*recall/(precision+recall)
print(y)

0.6666666666666666
2
1.0
[1 1 1 1 0 0]

# 종속변수의 실제값과 예측값 함수화된것.
print('accuracy', metrics.accuracy_score(y,p)) # 정확도
print("precision", metrics.precision_score(y,p)) # 정밀도
print('recall', metrics.recall_score(y,p)) # 민감도
print('f1',metrics.f1_score(y,p))
print(metrics.classification_report(y,p))
print(metrics.confusion_matrix(y,p))

accuracy 0.6666666666666666
precision 1.0
recall 0.5
f1 0.6666666666666666
              precision    recall  f1-score   support

           0       0.50      1.00      0.67         2
           1       1.00      0.50      0.67         4

    accuracy                           0.67         6
   macro avg       0.75      0.75      0.67         6
weighted avg       0.83      0.67      0.67         6

[[2 0]
[2 2]]

# 문제
# 분류에 대한 
# LogisticRegression
import sklearn.metrics as metrics
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression

X,y = make_classification(n_samples=28, n_features=2,
                         n_informative = 2, n_redundant =0,
                         random_state=0)

# 1) LogisticRegression 을 실시하시오
# 2) 훈련데이터를 예측하시오
# 3) 결과를 평가하시오

model = LogisticRegression() # model 만들고
model.fit(X,y) # 계수 완성 fit하고
# model1 = LogisticRegression().fit(X,y)
y_hat = model.predict(X) # predict 하면 끝!
# 정밀도(precision)
print("정밀도", metrics.precision_score(y,y_hat))
print("민감도", metrics.recall_score(y,y_hat))
print("F1 스코어", metrics.f1_score(y,y_hat))
print(metrics.classification_report(y,y_hat))
print(metrics.confusion_matrix(y,y_hat))

정밀도 0.8666666666666667
민감도 1.0
F1 스코어 0.9285714285714286
              precision    recall  f1-score   support

           0       1.00      0.87      0.93        15
           1       0.87      1.00      0.93        13

    accuracy                           0.93        28
   macro avg       0.93      0.93      0.93        28
weighted avg       0.94      0.93      0.93        28

[[13  2]
[ 0 13]]

y_hat.shape

(28,)

# 5번 나눠서 테스트
from sklearn.model_selection import cross_val_score
clf = svm.SVC(gamma='scale',random_state=0) # 모델을 가리키는 이름
Z1 = cross_val_score(clf,X,y,scoring = 'recall_macro',cv=5)
Z2 = cross_val_score(clf,X2,y2,scoring = 'recall_macro',cv=5)
print(Z1)
print(Z2)

[0.75 1. 0.5 1. 1. ]
[0.75 0.75 1. 1. 1. ]
<class 'numpy.ndarray'>

print(metrics.classification_report(y,y_hat))

              precision    recall  f1-score   support

           0       0.78      0.88      0.82         8
           1       0.86      0.75      0.80         8

    accuracy                           0.81        16
   macro avg       0.82      0.81      0.81        16
weighted avg       0.82      0.81      0.81        16

model_select: model에서 필요한 데이터를 어떻게 나눌 것인지(train_test_split, group k fold, kfold) feature select : RFE (recursive feature elimination) 분산이 작은놈이 중요하지 않은놈 feature extration : @text mining (count vectorize), tfidf (term frequency inverse document frequency)

평가 : 분류 평가 classification_score 예측 평가 accuracy score

# 샘플링내서 Train 값 Test값 나눠주는 함수
# select : model_select, Variable(변수선택), feature extraction
from sklearn.model_selection import train_test_split
# train data = 0.75 / test data = 0.25 비율로 나눔
X_train, X_test, y_train, y_test = train_test_split(X,y, stratify = y,random_state=42) # seed 값
# 층화 
logreg= LogisticRegression().fit(X_train,y_train)
# score = 정분류를 재라
print("훈련세트 점수 : {:.3f}".format(logreg.score(X_train,y_train)))

# 과대적합
print("테스트 세트 점수 : {:.3f}".format(logreg.score(X_test,y_test)))

훈련세트 점수 : 0.905
테스트 세트 점수 : 0.857

import numpy as np
from sklearn.model_selection import GroupKFold
# model_selection 모델데이터를 구분
# 1,2  1
# 3,4  2

# 5,6  3
# 7,8  4
X = np.array([[1,2],[3,4],[5,6],[7,8]]) # 4x2 변수2개
y = np.array([1,2,3,4]) # 종속변수
groups = np.array([0,0,2,2]) # 그룹을 구분
group_kfold = GroupKFold(n_splits=2) # 그룹을 2개로 나누겠다
group_kfold.get_n_splits(X,y,groups) 
print(group_kfold)
for train_index, test_index in group_kfold.split(X,y,groups):
    print("TRAIN:",train_index,"TEST:",test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    print(X_train, X_test, y_train, y_test)

저작자표시 (새창열림)

'Machine Learning' 카테고리의 다른 글

배치 크기(batch size)를 늘리는 방법 (0)	2023.04.04
(object detection)YOLOv5 학습예제(마스크데이터셋) (4)	2021.06.23
사이킷런(sklearn)을 이용한 머신러닝 - 4 (분류) (0)	2021.03.13
사이킷런(sklearn)을 이용한 머신러닝 - 3 (군집,분류) (0)	2021.03.12
사이킷런(sklearn)을 이용한 머신러닝 - 2 (xgboost) (0)	2021.03.11

ABOUT ME

DataCook DataCook

scikits machine Learning

scipy를 이용해서 ANN 진행

선형회귀(linear regression) : 기울기와 절편을 이용한 예측

OLS 선형회귀

sklearn

비선형회귀 = 다차방정식으로 변환 + lR -> 비선형회귀가됨.

로지스틱 회귀분석 logistic function => 0~1값으로 출력

로지스틱 회귀분석은 무조건 0~1값만 출력

Feature_selection

분류 예측

'Machine Learning' 카테고리의 다른 글

티스토리툴바

ABOUT ME

scikits machine Learning

scipy를 이용해서 ANN 진행

선형회귀(linear regression) : 기울기와 절편을 이용한 예측

OLS 선형회귀

sklearn

비선형회귀 = 다차방정식으로 변환 + lR -> 비선형회귀가됨.

로지스틱 회귀분석 logistic function => 0~1값으로 출력

로지스틱 회귀분석은 무조건 0~1값만 출력

Feature_selection

분류 예측

'Machine Learning' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바