[python 데이터 핸들링] 판다스 연습 튜토리얼 - 03_Grouping

2023. 6. 13. 19:59

728x90

Reference
- DataManim (https://www.datamanim.com/dataset/99_pandas/pandasMain.html#)
- <파이썬 한권으로 끝내기>, 데싸라면▪빨간색 물고기▪자투리코드, 시대고시기획 시대교육

DataSet

뉴욕 airBnB : https://www.kaggle.com/ptoscano230382/air-bnb-ny-2019

DataUrl = ‘https://raw.githubusercontent.com/Datamanim/pandas/main/AB_NYC_2019.csv%E2%80%99

Question

✔ 데이터를 로드하고 상위 5개 컬럼을 출력하라

In [ ]:

import pandas as pd
DataUrl = 'https://raw.githubusercontent.com/Datamanim/pandas/main/AB_NYC_2019.csv'
df = pd.read_csv(DataUrl)
df.head(5)

Out[ ]:

	id	name	host_id	host_name	neighbourhood_group	neighbourhood	latitude	longitude	room_type	price	minimum_nights	number_of_reviews	last_review	reviews_per_month	calculated_host_listings_count	availability_365
0	2539	Clean & quiet apt home by the park	2787	John	Brooklyn	Kensington	40.64749	-73.97237	Private room	149	1	9	2018-10-19	0.21	6	365
1	2595	Skylit Midtown Castle	2845	Jennifer	Manhattan	Midtown	40.75362	-73.98377	Entire home/apt	225	1	45	2019-05-21	0.38	2	355
2	3647	THE VILLAGE OF HARLEM....NEW YORK !	4632	Elisabeth	Manhattan	Harlem	40.80902	-73.94190	Private room	150	3	0	NaN	NaN	1	365
3	3831	Cozy Entire Floor of Brownstone	4869	LisaRoxanne	Brooklyn	Clinton Hill	40.68514	-73.95976	Entire home/apt	89	1	270	2019-07-05	4.64	1	194
4	5022	Entire Apt: Spacious Studio/Loft by central park	7192	Laura	Manhattan	East Harlem	40.79851	-73.94399	Entire home/apt	80	10	9	2018-11-19	0.10	1	0

✔ 데이터의 각 host_name의 빈도수를 구하고 host_name으로 정렬하여 상위 5개를 출력하라

In [ ]:

Ans = df.groupby('host_name').size()
Ans.head(5)

Out[ ]:

host_name
'Cil                        1
(Ari) HENRY LEE             1
(Email hidden by Airbnb)    6
(Mary) Haiy                 1
-TheQueensCornerLot         1
dtype: int64

+ .groupby('host_name'): DataFrame을 "host_name" 열의 값에 따라 그룹화

+ .size(): 각 그룹에 속하는 행의 개수를 계산

In [ ]:

Ans = df.host_name.value_counts().sort_index()
Ans.head(5)

Out[ ]:

host_name
'Cil                        1
(Ari) HENRY LEE             1
(Email hidden by Airbnb)    6
(Mary) Haiy                 1
-TheQueensCornerLot         1
Name: count, dtype: int64

+ df.host_name.value_counts() : "host_name" 열에서 각 고유한 값의 등장 횟수를 계산

+ .sort_index() : 인덱스를 기준으로 값을 오름차순으로 정렬. 즉, "host_name"의 고유한 값들이 정렬된 순서대로 등장 횟수가 정렬

✔ 데이터의 각 host_name의 빈도수를 구하고 빈도수 기준 내림차순 정렬한 데이터 프레임을 만들어라. 빈도수 컬럼은 counts로 명명하라

In [ ]:

df.groupby('host_name').size().to_frame().rename(columns={0:'counts'}).sort_values('counts', ascending=False)

Out[ ]:

	counts
host_name
Michael	417
David	403
Sonder (NYC)	327
John	294
Alex	279
...	...
Jerbean	1
Jerald	1
Jeonghoon	1
Jeny	1
현선	1

11452 rows × 1 columns

+ .to_frame(): 그룹별 행의 개수를 계산한 결과를 데이터프레임 형태로 변환

+ .rename(columns={0:'counts'}): 열 이름을 변경합니다. 기존 열 이름 0을 'counts'로 변경

+ .sort_values('counts', ascending=False): 'counts' 열을 기준으로 내림차순 정렬. 즉, 행의 개수가 가장 큰 그룹부터 정렬

✔ neighbourhood_group의 값에 따른 neighbourhood컬럼 값의 갯수를 구하여라

In [ ]:

df.groupby(['neighbourhood_group','neighbourhood'], as_index=False).size()

Out[ ]:

	neighbourhood_group	neighbourhood	size
0	Bronx	Allerton	42
1	Bronx	Baychester	7
2	Bronx	Belmont	24
3	Bronx	Bronxdale	19
4	Bronx	Castle Hill	9
...	...	...	...
216	Staten Island	Tottenville	7
217	Staten Island	West Brighton	18
218	Staten Island	Westerleigh	2
219	Staten Island	Willowbrook	1
220	Staten Island	Woodrow	1

221 rows × 3 columns

+ groupby(['neighbourhood_group','neighbourhood']): DataFrame을 "neighbourhood_group"과 "neighbourhood" 두 개의 열을 기준으로 그룹화. 즉, 같은 "neighbourhood_group"과 "neighbourhood" 값을 가지는 행들이 같은 그룹으로 묶이게 됨

+ as_index=False : .size() 메서드의 기본 동작은 그룹 이름을 인덱스로 사용하게 되는데, 이를 방지하기 위해 as_index=False를 사용하여 인덱스로 사용되는 그룹의 조합을 열로 유지

✔ neighbourhood_group의 값에 따른 neighbourhood컬럼 값 중 neighbourhood_group그룹의 최댓값들을 출력하라

In [ ]:

df.groupby(['neighbourhood_group', 'neighbourhood'], as_index=False).size().groupby(['neighbourhood_group'], as_index=False).max()

Out[ ]:

	neighbourhood_group	neighbourhood	size
0	Bronx	Woodlawn	70
1	Brooklyn	Windsor Terrace	3920
2	Manhattan	West Village	2658
3	Queens	Woodside	900
4	Staten Island	Woodrow	48

✔ neighbourhood_group 값에 따른 price값의 평균, 분산, 최대, 최소 값을 구하여라

In [ ]:

df[['neighbourhood_group', 'price']].groupby('neighbourhood_group').agg(['mean','var','max','min'])

Out[ ]:

	price
	mean	var	max	min
neighbourhood_group
Bronx	87.496792	11386.885081	2500	0
Brooklyn	124.383207	34921.719135	10000	0
Manhattan	196.875814	84904.159185	10000	0
Queens	99.517649	27923.130227	10000	10
Staten Island	114.812332	77073.088342	5000	13

+ agg : 집계(aggregation) 함수를 적용하는 메서드

'mean': 평균

'sum': 합

'min': 최솟값

'max': 최댓값

'count': 유효한(non-null) 값의 개수

'std': 표준 편차

'var': 분산

'median': 중앙값

'quantile(q)': 주어진 분위수(q)에 해당하는 값. q는 0 ~ 1 사이의 값

✔ neighbourhood_group 값에 따른 reviews_per_month 평균, 분산, 최대, 최소 값을 구하여라

In [ ]:

df[['neighbourhood_group','reviews_per_month']].groupby('neighbourhood_group').agg(['mean','var','max','min'])

Out[ ]:

	reviews_per_month
	mean	var	max	min
neighbourhood_group
Bronx	1.837831	2.799878	10.34	0.02
Brooklyn	1.283212	2.299040	14.00	0.01
Manhattan	1.272131	2.651206	58.50	0.01
Queens	1.941200	4.897848	20.94	0.01
Staten Island	1.872580	2.840895	10.12	0.02

✔ neighbourhood 값과 neighbourhood_group 값에 따른 price 의 평균을 구하라

In [ ]:

df.groupby(['neighbourhood', 'neighbourhood_group']).price.mean()

Out[ ]:

neighbourhood    neighbourhood_group
Allerton         Bronx                   87.595238
Arden Heights    Staten Island           67.250000
Arrochar         Staten Island          115.000000
Arverne          Queens                 171.779221
Astoria          Queens                 117.187778
                                           ...    
Windsor Terrace  Brooklyn               138.993631
Woodhaven        Queens                  67.170455
Woodlawn         Bronx                   60.090909
Woodrow          Staten Island          700.000000
Woodside         Queens                  85.097872
Name: price, Length: 221, dtype: float64

+ neighbourhood_group 값에 따른 price값 : df[['neighbourhood_group', 'price']].groupby('neighbourhood_group')

+ neighbourhood 값과 neighbourhood_group 값 : df.groupby(['neighbourhood', 'neighbourhood_group'])

✔ neighbourhood 값과 neighbourhood_group 값에 따른 price 의 평균을 계층적 indexing 없이 구하라

In [ ]:

df.groupby(['neighbourhood','neighbourhood_group']).price.mean().unstack()

Out[ ]:

neighbourhood_group	Bronx	Brooklyn	Manhattan	Queens	Staten Island
neighbourhood
Allerton	87.595238	NaN	NaN	NaN	NaN
Arden Heights	NaN	NaN	NaN	NaN	67.25
Arrochar	NaN	NaN	NaN	NaN	115.00
Arverne	NaN	NaN	NaN	171.779221	NaN
Astoria	NaN	NaN	NaN	117.187778	NaN
...	...	...	...	...	...
Windsor Terrace	NaN	138.993631	NaN	NaN	NaN
Woodhaven	NaN	NaN	NaN	67.170455	NaN
Woodlawn	60.090909	NaN	NaN	NaN	NaN
Woodrow	NaN	NaN	NaN	NaN	700.00
Woodside	NaN	NaN	NaN	85.097872	NaN

221 rows × 5 columns

+ unstack() : 계층적 인덱스를 가진 DataFrame이나 Series에서 인덱스를 열로 변환하는 데 사용

✔ neighbourhood 값과 neighbourhood_group 값에 따른 price 의 평균을 계층적 indexing 없이 구하고 nan 값은 -999값으로 채워라

In [ ]:

df.groupby(['neighbourhood','neighbourhood_group']).price.mean().unstack().fillna(-999)

Out[ ]:

neighbourhood_group	Bronx	Brooklyn	Manhattan	Queens	Staten Island
neighbourhood
Allerton	87.595238	-999.000000	-999.0	-999.000000	-999.00
Arden Heights	-999.000000	-999.000000	-999.0	-999.000000	67.25
Arrochar	-999.000000	-999.000000	-999.0	-999.000000	115.00
Arverne	-999.000000	-999.000000	-999.0	171.779221	-999.00
Astoria	-999.000000	-999.000000	-999.0	117.187778	-999.00
...	...	...	...	...	...
Windsor Terrace	-999.000000	138.993631	-999.0	-999.000000	-999.00
Woodhaven	-999.000000	-999.000000	-999.0	67.170455	-999.00
Woodlawn	60.090909	-999.000000	-999.0	-999.000000	-999.00
Woodrow	-999.000000	-999.000000	-999.0	-999.000000	700.00
Woodside	-999.000000	-999.000000	-999.0	85.097872	-999.00

221 rows × 5 columns

✔ 데이터중 neighbourhood_group 값이 Queens값을 가지는 데이터들 중 neighbourhood 그룹별로 price값의 평균, 분산, 최대, 최소값을 구하라

In [ ]:

df[df.neighbourhood_group=='Queens'].groupby(['neighbourhood']).price.agg(['mean','var','max','min'])
df.head(5)

Out[ ]:

	id	name	host_id	host_name	neighbourhood_group	neighbourhood	latitude	longitude	room_type	price	minimum_nights	number_of_reviews	last_review	reviews_per_month	calculated_host_listings_count	availability_365
0	2539	Clean & quiet apt home by the park	2787	John	Brooklyn	Kensington	40.64749	-73.97237	Private room	149	1	9	2018-10-19	0.21	6	365
1	2595	Skylit Midtown Castle	2845	Jennifer	Manhattan	Midtown	40.75362	-73.98377	Entire home/apt	225	1	45	2019-05-21	0.38	2	355
2	3647	THE VILLAGE OF HARLEM....NEW YORK !	4632	Elisabeth	Manhattan	Harlem	40.80902	-73.94190	Private room	150	3	0	NaN	NaN	1	365
3	3831	Cozy Entire Floor of Brownstone	4869	LisaRoxanne	Brooklyn	Clinton Hill	40.68514	-73.95976	Entire home/apt	89	1	270	2019-07-05	4.64	1	194
4	5022	Entire Apt: Spacious Studio/Loft by central park	7192	Laura	Manhattan	East Harlem	40.79851	-73.94399	Entire home/apt	80	10	9	2018-11-19	0.10	1	0

✔ 데이터중 neighbourhood_group 값에 따른 room_type 컬럼의 숫자를 구하고 neighbourhood_group 값을 기준으로 각 값의 비율을 구하여라

In [ ]:

Ans = df[['neighbourhood_group','room_type']].groupby(['neighbourhood_group','room_type']).size().unstack()
Ans.loc[:,:] = (Ans.values / Ans.sum(axis=1).values.reshape(-1,1))
Ans

Out[ ]:

room_type	Entire home/apt	Private room	Shared room
neighbourhood_group
Bronx	0.347388	0.597617	0.054995
Brooklyn	0.475478	0.503979	0.020543
Manhattan	0.609344	0.368496	0.022160
Queens	0.369926	0.595129	0.034945
Staten Island	0.471850	0.504021	0.024129

+ Ans.sum(axis=1) : 각 행의 합

+ .values.reshape(-1,1) : 1차원 배열을 열 벡터로 변환

+ Ans.values / Ans.sum(axis=1).values.reshape(-1,1) : Ans의 값들을 각 행의 합으로 나누어서 비율을 계산

+ .loc[:,:] : 모든 행과 열에 해당하는 부분을 선택하는 인덱싱

728x90

'🥇 certification logbook' 카테고리의 다른 글

[python 데이터 핸들링] 판다스 연습 튜토리얼 - 07_Merge , Concat (0)	2023.06.15
[python 데이터 핸들링] 판다스 연습 튜토리얼 - 06_Pivot (0)	2023.06.15
[python 데이터 핸들링] 판다스 연습 튜토리얼 - 05_Time_Series (0)	2023.06.13
[python 데이터 핸들링] 판다스 연습 튜토리얼 - 04_Apply , Map (0)	2023.06.13
[python 데이터 핸들링] 판다스 연습 튜토리얼 - 02 Filtering & Sorting (0)	2023.06.09
[python 데이터 핸들링] 판다스 연습 튜토리얼 - 01 Getting & Knowing Data (0)	2023.06.08
[ADsP] 비지도학습 - 자기조직화지도(SOM) & 다차원척도법(MDS) (0)	2023.06.08
앙상블 (Ensemble) - 랜덤 포레스트 분류 (Random Forest Classifier) (0)	2023.06.07

I study SO

Menu

Category

Tags

[python 데이터 핸들링] 판다스 연습 튜토리얼 - 03_Grouping

DataSet

Question

'🥇 certification logbook' 카테고리의 다른 글

티스토리툴바