Pandas

Prerequisite : Table

행과 열을 이용해서 데이터를 저장하고 관리하는 자료구조(컨테이너)
주로 행은 개체, 열은 속성을 나타낸다.

Pandas 시작하기

import pandas 를 통해서 진행

import pandas as pd

Ⅱ. pandas로 1차원 다루기 - Series

Series?

1-D labeled array
인덱스를 지정해줄 수 있음

s = pd.Series([1,4,9,16,25])
s

   1
   4
   9
  16
  25
dtype: int64

t = pd.Series({'one':1, 'two':2, 'three':3, 'four':4,'five':5})
t

one      1
two      2
three    3
four     4
five     5
dtype: int64

Series + Numpy

Series는 ndarray와 유사하다!

t[1]

t[1:3]

two      2
three    3
dtype: int64

s[s > s.median()] # 자기 자신의 median(중앙값) 보다 큰 값들만 가져온다

3    16
4    25
dtype: int64

s[[4,1,1]]

  25
   4
   4
dtype: int64

import numpy as np

np.exp(s)

  2.718282e+00
  5.459815e+01
  8.103084e+03
  8.886111e+06
  7.200490e+10
dtype: float64

Series + dict

series는 dict와 유사하다

one      1
two      2
three    3
four     4
five     5
dtype: int64

'six' in t

False

'five' in t

True

t.get('five')

t.get('six')

t.get('six', 0) # 딕셔너리와 유사

Series에 이름 붙이기

name 속성을 가지고 있다.
처음 Series를 만들 때 이름을 붙일 수 있습니다.

s = pd.Series(np.random.randn(5), name="random_nums") # randn() = 가우시안 분포 상에서의 임의의 난수 생성
s

 -0.412482
  0.938129
  1.569555
 -0.181722
  0.555711
Name: random_nums, dtype: float64

s.name = "임의의 난수"

 -0.412482
  0.938129
  1.569555
 -0.181722
  0.555711
Name: 임의의 난수, dtype: float64

3. Pandas로 2차원 데이터 다루기 - dataframe

dataframe이란?

2-D labeled table
인덱스를 지정할 수도 있음

d = {"height":[1,2,3,4], "weight":[30,40,50,60]}
df = pd.DataFrame(d)
df

	height	weight
0	1	30
1	2	40
2	3	50
3	4	60

## dtype 확인
df.dtypes

height    int64
weight    int64
dtype: object

From CSV to dataframe

Comma Serperative Value - CSV 파일 csv -> dataframe

Comma Separated Value를 DataFrame으로 생성해줄 수 잇다.
read_csv()를 이용

# 동일 경로에 country_wise_latest.csv가 존재하면:

covid = pd.read_csv("./country_wise_latest.csv")

covid

	Country/Region	Confirmed	Deaths	Recovered	Active	New cases	New deaths	New recovered	Deaths / 100 Cases	Recovered / 100 Cases	Deaths / 100 Recovered	Confirmed last week	1 week change	1 week % increase	WHO Region
0	Afghanistan	36263	1269	25198	9796	106	10	18	3.50	69.49	5.04	35526	737	2.07	Eastern Mediterranean
1	Albania	4880	144	2745	1991	117	6	63	2.95	56.25	5.25	4171	709	17.00	Europe
2	Algeria	27973	1163	18837	7973	616	8	749	4.16	67.34	6.17	23691	4282	18.07	Africa
3	Andorra	907	52	803	52	10	0	0	5.73	88.53	6.48	884	23	2.60	Europe
4	Angola	950	41	242	667	18	1	0	4.32	25.47	16.94	749	201	26.84	Africa
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
182	West Bank and Gaza	10621	78	3752	6791	152	2	0	0.73	35.33	2.08	8916	1705	19.12	Eastern Mediterranean
183	Western Sahara	10	1	8	1	0	0	0	10.00	80.00	12.50	10	0	0.00	Africa
184	Yemen	1691	483	833	375	10	4	36	28.56	49.26	57.98	1619	72	4.45	Eastern Mediterranean
185	Zambia	4552	140	2815	1597	71	1	465	3.08	61.84	4.97	3326	1226	36.86	Africa
186	Zimbabwe	2704	36	542	2126	192	2	24	1.33	20.04	6.64	1713	991	57.85	Africa

187 rows × 15 columns

Pandas 활용 1. 일부분만 관찰하기

head(n):처음 n개의 데이터 참조

# 위에서부터 5개를 관찰하는 방법(함수)

covid.head(5)

	Country/Region	Confirmed	Deaths	Recovered	Active	New cases	New deaths	New recovered	Deaths / 100 Cases	Recovered / 100 Cases	Deaths / 100 Recovered	Confirmed last week	1 week change	1 week % increase	WHO Region
0	Afghanistan	36263	1269	25198	9796	106	10	18	3.50	69.49	5.04	35526	737	2.07	Eastern Mediterranean
1	Albania	4880	144	2745	1991	117	6	63	2.95	56.25	5.25	4171	709	17.00	Europe
2	Algeria	27973	1163	18837	7973	616	8	749	4.16	67.34	6.17	23691	4282	18.07	Africa
3	Andorra	907	52	803	52	10	0	0	5.73	88.53	6.48	884	23	2.60	Europe
4	Angola	950	41	242	667	18	1	0	4.32	25.47	16.94	749	201	26.84	Africa

tail(n): 마지막 n개의 데이터를 참조

# 아래에서부터 5개를 관찰하는 방법(함수)

covid.tail(5)

	Country/Region	Confirmed	Deaths	Recovered	Active	New cases	New deaths	New recovered	Deaths / 100 Cases	Recovered / 100 Cases	Deaths / 100 Recovered	Confirmed last week	1 week change	1 week % increase	WHO Region
182	West Bank and Gaza	10621	78	3752	6791	152	2	0	0.73	35.33	2.08	8916	1705	19.12	Eastern Mediterranean
183	Western Sahara	10	1	8	1	0	0	0	10.00	80.00	12.50	10	0	0.00	Africa
184	Yemen	1691	483	833	375	10	4	36	28.56	49.26	57.98	1619	72	4.45	Eastern Mediterranean
185	Zambia	4552	140	2815	1597	71	1	465	3.08	61.84	4.97	3326	1226	36.86	Africa
186	Zimbabwe	2704	36	542	2126	192	2	24	1.33	20.04	6.64	1713	991	57.85	Africa

Pandas 활용 2. 데이터 접근하기

df['column_name'] or df.column_name

covid['Confirmed']

    36263
     4880
    27973
      907
      950
       ...
  10621
     10
   1691
   4552
   2704
Name: Confirmed, Length: 187, dtype: int64

covid.Active # column_name에 공백이 포함된 경우는 사용이 불가하다

    9796
    1991
    7973
      52
     667
       ...
  6791
     1
   375
  1597
  2126
Name: Active, Length: 187, dtype: int64

Honey Tip! dataFrame의 각 column은 “Series” 이다.

type(covid['Confirmed']) #

pandas.core.series.Series

covid['Confirmed'][1]

covid['Confirmed'][1:5]

   4880
  27973
    907
    950
Name: Confirmed, dtype: int64

Pandas 활용 3. “조건”을 이용해서 데이터 접근하기

# 신규 확진자가 100명이 넘는 나라를 뽑아보자

covid[covid["New cases"] > 100]

	Country/Region	Confirmed	Deaths	Recovered	Active	New cases	New deaths	New recovered	Deaths / 100 Cases	Recovered / 100 Cases	Deaths / 100 Recovered	Confirmed last week	1 week change	1 week % increase	WHO Region
0	Afghanistan	36263	1269	25198	9796	106	10	18	3.50	69.49	5.04	35526	737	2.07	Eastern Mediterranean
1	Albania	4880	144	2745	1991	117	6	63	2.95	56.25	5.25	4171	709	17.00	Europe
2	Algeria	27973	1163	18837	7973	616	8	749	4.16	67.34	6.17	23691	4282	18.07	Africa
6	Argentina	167416	3059	72575	91782	4890	120	2057	1.83	43.35	4.21	130774	36642	28.02	Americas
8	Australia	15303	167	9311	5825	368	6	137	1.09	60.84	1.79	12428	2875	23.13	Western Pacific
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
177	United Kingdom	301708	45844	1437	254427	688	7	3	15.19	0.48	3190.26	296944	4764	1.60	Europe
179	Uzbekistan	21209	121	11674	9414	678	5	569	0.57	55.04	1.04	17149	4060	23.67	Europe
180	Venezuela	15988	146	9959	5883	525	4	213	0.91	62.29	1.47	12334	3654	29.63	Americas
182	West Bank and Gaza	10621	78	3752	6791	152	2	0	0.73	35.33	2.08	8916	1705	19.12	Eastern Mediterranean
186	Zimbabwe	2704	36	542	2126	192	2	24	1.33	20.04	6.64	1713	991	57.85	Africa

82 rows × 15 columns

covid["WHO Region"].unique() # 범주의 종류를 리스트 형태로 번환해줌
covid[covid["WHO Region"] == "South-East Asia"]

	Country/Region	Confirmed	Deaths	Recovered	Active	New cases	New deaths	New recovered	Deaths / 100 Cases	Recovered / 100 Cases	Deaths / 100 Recovered	Confirmed last week	1 week change	1 week % increase	WHO Region
13	Bangladesh	226225	2965	125683	97577	2772	37	1801	1.31	55.56	2.36	207453	18772	9.05	South-East Asia
19	Bhutan	99	0	86	13	4	0	1	0.00	86.87	0.00	90	9	10.00	South-East Asia
27	Burma	350	6	292	52	0	0	2	1.71	83.43	2.05	341	9	2.64	South-East Asia
79	India	1480073	33408	951166	495499	44457	637	33598	2.26	64.26	3.51	1155338	324735	28.11	South-East Asia
80	Indonesia	100303	4838	58173	37292	1525	57	1518	4.82	58.00	8.32	88214	12089	13.70	South-East Asia
106	Maldives	3369	15	2547	807	67	0	19	0.45	75.60	0.59	2999	370	12.34	South-East Asia
119	Nepal	18752	48	13754	4950	139	3	626	0.26	73.35	0.35	17844	908	5.09	South-East Asia
158	Sri Lanka	2805	11	2121	673	23	0	15	0.39	75.61	0.52	2730	75	2.75	South-East Asia
167	Thailand	3297	58	3111	128	6	0	2	1.76	94.36	1.86	3250	47	1.45	South-East Asia
168	Timor-Leste	24	0	0	24	0	0	0	0.00	0.00	0.00	24	0	0.00	South-East Asia

Pandas 활용 4. 행을 기준으로 데이터 접근하기

# 예시 데이터 - 도서관 정보

books_dict = {"Available":[True, True, False], "Location":[102, 215, 323], "Genre":["Programming", "Physics", "Math"]}

books_df = pd.DataFrame(books_dict, index=["버그란 무엇인가", "두근두근 물리학","미분해줘 홈즈"])

books_df

	Available	Location	Genre
버그란 무엇인가	True	102	Programming
두근두근 물리학	True	215	Physics
미분해줘 홈즈	False	323	Math

인덱스를 이용해서 가져오기 : `.loc[row, col]`

books_df.loc["두근두근 물리학"]

Available       True
Location         215
Genre        Physics
Name: 두근두근 물리학, dtype: object

# "미분해줘 홈즈 책이 대출 가능한지?"

books_df.loc["미분해줘 홈즈", "Available"]

False

숫자 인덱스를 이용해서 가져오기 : `.iloc[rowidx, colidx`

books_df.iloc[1,1]

books_df.iloc[1,:]

Available       True
Location         215
Genre        Physics
Name: 두근두근 물리학, dtype: object

Pandas 활용 5. groupby

Split : 특정한 “기준”을 바탕으로 DataFrame을 분할
Apply : 통계함수 - sum(), mean(), median(), - 을 적용해서 각 데이터를 압축
Combine : Apply된 결과를 바탕으로 새로운 Series를 생성 (group_key : applied_value)

.groupby()

covid.head(5)

	Country/Region	Confirmed	Deaths	Recovered	Active	New cases	New deaths	New recovered	Deaths / 100 Cases	Recovered / 100 Cases	Deaths / 100 Recovered	Confirmed last week	1 week change	1 week % increase	WHO Region
0	Afghanistan	36263	1269	25198	9796	106	10	18	3.50	69.49	5.04	35526	737	2.07	Eastern Mediterranean
1	Albania	4880	144	2745	1991	117	6	63	2.95	56.25	5.25	4171	709	17.00	Europe
2	Algeria	27973	1163	18837	7973	616	8	749	4.16	67.34	6.17	23691	4282	18.07	Africa
3	Andorra	907	52	803	52	10	0	0	5.73	88.53	6.48	884	23	2.60	Europe
4	Angola	950	41	242	667	18	1	0	4.32	25.47	16.94	749	201	26.84	Africa

# WHO Region 별 확진자수

# 1. covid에서 확진자 수 column만 추출한다
# 2. 이를 covid의 WHO Region을 기준으로 groupby한다.

covid_by_region = covid['Confirmed'].groupby(by=covid["WHO Region"])
covid_by_region
# <pandas.core.groupby.generic.SeriesGroupBy object at 0x000001F5C5DD5730> - split 만 적용된 상태

covid_by_region.sum()

WHO Region
Africa                    2129.5
Americas                  7340.0
Eastern Mediterranean    28575.0
Europe                   12191.0
South-East Asia           3333.0
Western Pacific           1009.5
Name: Confirmed, dtype: float64

# 국가당 감염자 수

covid_by_region.mean()

WHO Region
Africa                    15066.812500
Americas                 252551.028571
Eastern Mediterranean     67761.090909
Europe                    58920.053571
South-East Asia          183529.700000
Western Pacific           18276.750000
Name: Confirmed, dtype: float64

# 중앙값

covid_by_region.median()

WHO Region
Africa                    2129.5
Americas                  7340.0
Eastern Mediterranean    28575.0
Europe                   12191.0
South-East Asia           3333.0
Western Pacific           1009.5
Name: Confirmed, dtype: float64

DataFrame에서 데이터 정렬하기

.sort_values(*by*= , *axis*= , *ascending*= False)

by는 정렬하고자 하는 기준 행/열을 입력한다.
axis는 0이면 행을 정렬하고, 1이면 열을 정렬한다. (일반적으로는 행을 정렬)
ascending은 말그대로 오름차순 기능이다. False를 주면 내림차순으로 정렬된다.