파이썬 데이터 분석 _ pandas 라이브러리 활용 1 : Series / DataFrame

LG U+ why not SW 5/python

파이썬 데이터 분석 _ pandas 라이브러리 활용 1 : Series / DataFrame

wangatheringdata 2025. 3. 2. 16:23

파이썬

1. 특징

1) 신속하고 효율적인 표현

- 행과 열 바꾸기 가능

- 실 세계 데이터 분석

2) 다양한 형태의 데이터 표현

- 서로 다른 형태의 데이터 표현 예) 시계열, 레이블 가진 데이터, 다양한 관측 데이터

3) Seiries : 1차원 행 구조 / DataFrame : 2차원 행렬 구조

4) 결측null 데이터 처리

- 데이터 추가 및 삭제 : 특정 행렬에 삽입과 삭제 및 추가 가능
- 데이터 정렬과 조작

2. 기능

1) 리스트, 딕셔너리, numpy 배열 등의 데이터를 DataFrame으로 변환

2) csv / xls 파일 등으로 작업

3) url을 통해 웹페이지의 csv, json과 같은 원격데이터 및 데이터베이스 등 분석

4) 데이터 보기 및 검사

    - mean() : 특정 열의 평균
    - corr() : 상관 관계
    - count() : 열의 데이터 수

5) 필터, 정렬, 그룹화

- sort_values() : 정렬
- groupby() : 기준에 의해 몇 가지의 그룹화

6) 데이터 정제

데이터 누락 및 특정 값을 다른 값으로 일관 변경

3. 구조

1) Series 사용

1-1. Series 생성

kakao = Series([92600, 92400, 92100, 94300, 92300])
print(type(kakao))

사용할 List의 이름을 부여하고,

이름 = series([값1, 값2, 값3 ... 값n]) 의 형태로 생성.

각 값이 행에 삽입된 한개의 열 형태의 series가

별도의 index이 없이 0부터 index가 부여된 채 생성됨.

1-2. index를 부여한 Series 생성

#series의 index로 사용할 열 생성
idx = ['2024-01-02', '2024-02-06', '2024-03-02', '2024-04-07', '2024-06-08']

#series의 데이터로 사용할 list 생성 _ index와 데이터 수 일치
data = [92600, 92400, 92100, 94300, 92300]

#index와 데이터를 이용해 Series 객체 생성 _ 별도의 이름 부여 예) sample = 
sample = Series(data, index=idx)



#한꺼번에 생성 가능
sample = Series([92600, 92400, 92100, 94300, 92300],
	index=['2024-01-02', '2024-02-06', '2024-03-02', '2024-04-07', '2024-06-08'])

list 형태의 데이터 객체를 생성 후,

별도로 index로 사용할 list 형태의 열을 생성.

두 데이터를 하나의 series로 병합해 사용 가능.

1-3. index와 값 각각 추출

for data in sample.index:
    print(data)

for data in sample.values:
    print(data)

.index / .values를 이용해 각각 index와 값을 추출.

1-4. Series 간의 연산

mine = Series([10, 20, 30], index=['naver', 'kt', 'sk'])
friend = Series([10, 30, 20], index=['kt', 'naver', 'sk'])

#두 series를 더해 같은 index 별 병합
merge = mine+friend
sub = mine-friend
mul = mine*friend
div + mine/friend

#결과
'''
kt       30
naver    40
sk       50
dtype: int64

kt       10
naver   -20
sk       10
dtype: int64

kt       200
naver    300
sk       600
dtype: int64

kt       2.000000
naver    0.333333
sk       1.500000
dtype: float64
'''

2) DataFrame 사용

주로 딕셔너리 사용해 각 열에 대한 데이터 저장

2-1. DataFrame 생성

#Dataframe 객체 생성을 위한 key : value의 구조의 dictionary 생성
raw_data = {'col0': [1, 2, 3, 4],
            'col1': [10, 20, 30, 40],
            'col2': [100, 200, 300, 400]}

#DataFrame 이름 = DataFrame(딕셔너리이름) 구조로 Dataframe 객체 생성
dataframe_data = DataFrame(raw_data)
print(dataframe_data)

dinctionary를 이용해 DataFrame객체를 생성 시,
dinctionary의 key가 DataFrame의 열 이름으로 자동 indexing.

dinctionary의 values에는 list로 0부터 정수로 각각 indexing.

#DataFrame의 컬럼은 Series 형태
print(type(dataframe_data['col0']))

#결과
'''
<class 'pandas.core.series.Series'>
'''

2-2. DataFrame의 index 변경

daeshin = {'open': [11650, 11100, 11200, 11100, 11000],
           'high': [12100, 11800, 11200, 11200, 11500],
           'low': [11600, 11050, 11120, 11450, 11450],
           'close': [12350, 11540, 11540, 11500, 11300]}

#DataFrame의 열 이름 별도 설정 : columns = [list]
#부여한 열 이름 순서 변경하는 법 : columns = [dictionary의 key]
daeshin_day = DataFrame(daeshin)
daeshin_day2 = DataFrame(daeshin, columns=['open', 'low', 'close', 'high'])

#list 생성하여 DataFrame의 index 설정 : index = [list]
#단, 열(dictionary의 key)의 데이터(dictionary의 value) 갯수, 즉 행의 숫자와 동일
daeshin_index = ['21.12.01', '21.12.02', '21.12.03', '21.12.04', '21.12.05']
daeshin_day3 = DataFrame(daeshin, columns=['open', 'low', 'close', 'high'],
                         index=daeshin_index)

2-3. DataFrame의 데이터 추출

#DataFrame의 데이터 추출 1 : 열 이름 이용 _ 해당 열의 각 행의 값
print(daeshin_day3['open'])

#결과
'''
21.12.01    11650
21.12.02    11100
21.12.03    11200
21.12.04    11100
21.12.05    11000
Name: open, dtype: int64
'''



#DataFrame의 데이터 추출 2 : index 이용 _ 해당 행의 각 열의 값
print(daeshin_day3['21.12.01' : '21.12.05'])

#결과
'''
           open    low  close   high
21.12.01  11650  11600  12350  12100
21.12.02  11100  11050  11540  11800
21.12.03  11200  11120  11540  11200
21.12.04  11100  11450  11500  11200
21.12.05  11000  11450  11300  11500
'''