728x90

Reference
- <파이썬 한권으로 끝내기>, 데싸라면▪빨간색 물고기▪자투리코드, 시대고시기획 시대교육

SciPy 패키지를 이용한 통계분석

https://docs.scipy.org/doc/scipy/reference/stats.html

Statistical functions (scipy.stats) — SciPy v1.10.1 Manual

Statistical functions (scipy.stats) This module contains a large number of probability distributions, summary and frequency statistics, correlation functions and statistical tests, masked statistics, kernel density estimation, quasi-Monte Carlo functionali

docs.scipy.org

교차분석($x^2$ 검정)

데이터에서 각 범주에 따른 결과변수의 분포를 설명하거나, 범주형 변수가 두 개 이상인 경우 두 변수가 상관이 있는지를 검정하는 것이 목적

설명변수 : 연속형

DataSet

https://www.kaggle.com/datasets/akshaysehgal/titanic-data-for-data-preprocessing

Titanic data for Data Preprocessing

Structured as popular Seaborn API

www.kaggle.com

1. 적합성 검정

각 범주에 따른 데이터의 빈도분포가 이론적으로 기대하는 분포를 따르는지를 검정하는 방법

titanic데이터에서 sex 변수에 대한 분할표를 생성하고, 가설에 대한 적합도 검정을 수행해보자 (유의수준 : 0.05)

귀무가설 : 타이타닉호의 생존자 중 남자의 비율이 50%, 여자의 비율이 50% 이다

대립가설 : 타이타닉호의 생존자 중 남자의 비율이 50%, 여자의 비율이 50%라고 할 수 없다

In [ ]:

import pandas as pd
df = pd.read_csv("./data/titanic.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   survived     891 non-null    int64  
 1   pclass       891 non-null    int64  
 2   sex          891 non-null    object 
 3   age          714 non-null    float64
 4   sibsp        891 non-null    int64  
 5   parch        891 non-null    int64  
 6   fare         891 non-null    float64
 7   embarked     889 non-null    object 
 8   class        891 non-null    object 
 9   adult_male   891 non-null    bool   
 10  embark_town  889 non-null    object 
dtypes: bool(1), float64(2), int64(4), object(4)
memory usage: 70.6+ KB

In [ ]:

df_t = df[df['survived'] == 1]
table = df_t[['sex']].value_counts()
table

Out[ ]:

sex   
female    233
male      109
Name: count, dtype: int64

In [ ]:

from scipy.stats import chisquare
mean = (table['female']+table['male'])/2
chi = chisquare(table, f_exp=[mean,mean])
print('<적합도 검정>\n', chi)

<적합도 검정>
 Power_divergenceResult(statistic=44.95906432748538, pvalue=2.011967257447723e-11)

pvalue 값은 유의수준(0.05)보다 작으므로 귀무가설을 기각 즉, 타이타닉호에 탄 남자와 여자의 비율은 50:50이라고 할 수 없음

2. 독립성 검정

모집단이 두 개의 변수 A, B에 의해 범주화되었을 때, 이 두 변수들 사이의 관계가 독립인지 아닌지를 검정

titanic 데이터에서 좌석등급(class)와 생존 여부(survived)가 서로 독립인지 검정해보자

귀무가설 : class 변수와 survived 변수는 독립이다

대립가설 : class 변수와 survived 변수는 독립이 아니다

In [ ]:

df = pd.read_csv("data/titanic.csv")
# crosstab : 교차표 생성
table = pd.crosstab(df['class'], df['survived'])
table

Out[ ]:

survived	0	1
class
First	80	136
Second	97	87
Third	372	119

In [ ]:

from scipy.stats import chi2_contingency
chi2_contingency(table)

Out[ ]:

Chi2ContingencyResult(statistic=102.88898875696056, pvalue=4.549251711298793e-23, dof=2, expected_freq=array([[133.09090909,  82.90909091],
       [113.37373737,  70.62626263],
       [302.53535354, 188.46464646]]))

pvalue가 유의수준(0.05)보다 작으므로 귀무가설을 기각

즉, 좌석 등급과 생존은 독립이 아니라고 할 수 있다.

3.동질성 검정

모집단이 임의의 변수에 따라 R개의 속성으로 범주화되었을 때, R개의 부분 모집단에서 추출한 표본이 C개의 범주화된 집단의 분포가 서로 동일한가를 검정

독립성 검정의 계산법과 검정 방법은 동일

4. 정리

교차표 생성 : crosstab
적합성 검정 : chisquare
독립성 검정 : chi2_contingency
동질성 검정 : chi2_contingency

728x90

'🥇 certification logbook' 카테고리의 다른 글

[Coursera/IBM course #1] 머신러닝이란 무엇인가 (0)	2025.05.02
[Coursera/IBM] IBM AI Engineering PC 및 Machine Learning with Python 코스 소개 (1)	2025.04.30
빅데이터분석기사 (빅분기) 실기 총 정리 / 시험 시작 전 확인 (0)	2023.12.19
[ADsP] 군집분석 (0)	2023.06.18
[python 통계분석] t-test 검정 (0)	2023.06.18
[python 데이터 전처리] 데이터 스케일링 (data scaling) (0)	2023.06.15
[python 데이터 핸들링] 판다스 연습 튜토리얼 - 07_Merge , Concat (0)	2023.06.15
[python 데이터 핸들링] 판다스 연습 튜토리얼 - 06_Pivot (0)	2023.06.15

[python 통계분석] 교차분석(카이제곱 검정)

교차분석($x^2$ 검정)

DataSet

1. 적합성 검정

2. 독립성 검정

3.동질성 검정

4. 정리

'🥇 certification logbook' 카테고리의 다른 글

티스토리툴바