튜토리얼

필수 데이터사이언스 방법론 가이드
카테고리는 수준별, 언어별, 분야별, 주제별 순으로 정렬되어 있습니다.

Python을 이용한 데이터 활용 예제

작성자
DaeSeo
작성일
2017-01-11 17:15
조회
117
카테고리 : [ 초급   |   Python   |   기타   |   데이터 전처리 ]

1) 제목


Python을 이용한 데이터 활용 예제

2) 작성자


컴퓨터정보통신공학과 박대서

3) 링크 및 출처


  • Python for Data Analysis, O'REILLY, Wes Mckinney

4) 문제 및 개요



5) 데이터


 

6) 해결방법 및 결과


bit.ly의 1.usa.gov 데이터
import json
path = '../pydata-book-master/ch02/usagov_bitly_data2012-03-16-1331923249.txt'
'''open(path, mode='r', encoding='utf-8').readline()'''

records = [json.loads(line) for line in open(path, mode='r', encoding='utf-8-sig')]
records[0]

결과:

{'a': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.78 Safari/535.11',
'al': 'en-US,en;q=0.8',
'c': 'US',
'cy': 'Danvers',
'g': 'A6qOVH',
'gr': 'MA',
'h': 'wfLQtf',
'hc': 1331822918,
'hh': '1.usa.gov',
'l': 'orofrog',
'll': [42.576698, -70.954903],
'nk': 1,
'r': 'http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/wfLQtf',
't': 1331923247,
'tz': 'America/New_York',
'u': 'http://www.ncbi.nlm.nih.gov/pubmed/22415991'}


time_zones = [rec['tz'] for rec in records if 'tz' in rec]
time_zones[:10]

결과 :

['America/New_York',
'America/Denver',
'America/New_York',
'America/Sao_Paulo',
'America/New_York',
'America/New_York',
'Europe/Warsaw',
'',
'',
'']

 

# collections의 Counter 라이브러리를 이용한 표준시간대 카운트
from collections import Counter
counts = Counter(time_zones)
counts.most_common(10)

결과 :

[('America/New_York', 1251),
('', 521),
('America/Chicago', 400),
('America/Los_Angeles', 382),
('America/Denver', 191),
('Europe/London', 74),
('Asia/Tokyo', 37),
('Pacific/Honolulu', 36),
('Europe/Madrid', 35),
('America/Sao_Paulo', 33)]

 

# Pandas를 이용한 데이터프레임 작성
from pandas import DataFrame, Series
import pandas as pd
frame = DataFrame(records)
frame.info()

결과 :

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3560 entries, 0 to 3559
Data columns (total 18 columns):
_heartbeat_ 120 non-null float64
a 3440 non-null object
al 3094 non-null object
c 2919 non-null object
cy 2919 non-null object
g 3440 non-null object
gr 2919 non-null object
h 3440 non-null object
hc 3440 non-null float64
hh 3440 non-null object
kw 93 non-null object
l 3440 non-null object
ll 2919 non-null object
nk 3440 non-null float64
r 3440 non-null object
t 3440 non-null float64
tz 3440 non-null object
u 3440 non-null object
dtypes: float64(4), object(14)
memory usage: 500.7+ KB


# 가장 많이 등장하는 표준시간대 그래프

%matplotlib inline
tz_counts[:10].plot(kind='barh', rot=0)


# 표준 시간대, 운영체제에 따른 그룹화
import numpy as np
operating_system = np.where(cframe['a'].str.contains('Windows'), 'Windows', 'Not Windows')
by_tz_os = cframe.groupby(['tz', operating_system])
agg_counts = by_tz_os.size().unstack().fillna(0)
agg_counts

Not Windows Windows
tz
245.0 276.0
Africa/Cairo 0.0 3.0
Africa/Casablanca 0.0 1.0
Africa/Ceuta 0.0 2.0
Africa/Johannesburg 0.0 1.0
Africa/Lusaka 0.0 1.0
America/Anchorage 4.0 1.0
America/Argentina/Buenos_Aires 1.0 0.0
America/Argentina/Cordoba 0.0 1.0
America/Argentina/Mendoza 0.0 1.0
America/Bogota 1.0 2.0
America/Caracas 0.0 1.0
America/Chicago 115.0 285.0
America/Chihuahua 1.0 1.0
America/Costa_Rica 0.0 1.0
America/Denver 132.0 59.0
America/Edmonton 2.0 4.0
America/Guayaquil 2.0 0.0
America/Halifax 1.0 3.0
America/Indianapolis 8.0 12.0
America/La_Paz 0.0 1.0
America/Lima 0.0 1.0
America/Los_Angeles 130.0 252.0
America/Managua 0.0 3.0
America/Mazatlan 1.0 0.0
America/Mexico_City 7.0 8.0
America/Monterrey 1.0 0.0
America/Montevideo 0.0 1.0
America/Montreal 3.0 6.0
America/New_York 339.0 912.0
... ... ...
Europe/Berlin 9.0 19.0
Europe/Bratislava 1.0 2.0
Europe/Brussels 1.0 3.0
Europe/Bucharest 1.0 3.0
Europe/Budapest 0.0 5.0
Europe/Copenhagen 2.0 3.0
Europe/Dublin 1.0 2.0
Europe/Helsinki 2.0 8.0
Europe/Lisbon 1.0 7.0
Europe/Ljubljana 0.0 1.0
Europe/London 43.0 31.0
Europe/Madrid 16.0 19.0
Europe/Malta 0.0 2.0
Europe/Moscow 1.0 9.0
Europe/Oslo 2.0 8.0
Europe/Paris 4.0 10.0
Europe/Prague 3.0 7.0
Europe/Riga 1.0 1.0
Europe/Rome 8.0 19.0
Europe/Skopje 0.0 1.0
Europe/Sofia 0.0 1.0
Europe/Stockholm 2.0 12.0
Europe/Uzhgorod 0.0 1.0
Europe/Vienna 3.0 3.0
Europe/Vilnius 0.0 2.0
Europe/Volgograd 0.0 1.0
Europe/Warsaw 1.0 15.0
Europe/Zurich 4.0 0.0
Pacific/Auckland 3.0 8.0
Pacific/Honolulu 0.0 36.0
 
indexer = agg_counts.sum(1).argsort()
indexer[25:29]
agg_counts.take(indexer[0:5])

결과 : 

tz
America/Mexico_City 86
America/Monterrey 11
America/Montevideo 14
America/Montreal 52
dtype: int64


Not Windows Windows
tz
America/Mazatlan 1.0 0.0
America/La_Paz 0.0 1.0
America/Lima 0.0 1.0
Europe/Volgograd 0.0 1.0
Europe/Sofia 0.0 1.0
# index를 기준으로 정렬 결과 출력
count_subset = agg_counts.take(indexer)[-10:]
count_subset

결과 :

Not Windows Windows
tz
America/Sao_Paulo 13.0 20.0
Europe/Madrid 16.0 19.0
Pacific/Honolulu 0.0 36.0
Asia/Tokyo 2.0 35.0
Europe/London 43.0 31.0
America/Denver 132.0 59.0
America/Los_Angeles 130.0 252.0
America/Chicago 115.0 285.0
245.0 276.0
America/New_York 339.0 912.0

#윈도우/비윈도우 사용자의 표준시간대 순위
count_subset.plot(kind= 'barh', stacked=True)


# 작은 그룹에 대한 상대 비율을 확인하기 위해 1로 정규화
normed_subset = count_subset.div(count_subset.sum(1), axis=0)
normed_subset.plot(kind='barh', stacked=True)


Movielens의 영화 평점 데이터

  • 장르, 개봉연도, 사용자 나이, 우편번호, 성별, 직업 정보 제공

  • 추전 시스템 개발에 주로 사용되는 데이터

  • 여기에서는 요구사항에 맞게 잘 쪼개는 방법을 소개

  • MovieLens 1M 데이터셋은 약 6000여명의 사용자들로 부터 수집한 4000여 편의 영화에 대한 백만개의 영화 평점을 포함


import pandas as pd
import os
encoding = 'latin1'

# 사용자, 평점, 영화정보 테이블을 각각 데이터 프레임으로 호출
upath = os.path.expanduser('../pydata-book-master/ch02/movielens/users.dat')
rpath = os.path.expanduser('../pydata-book-master/ch02/movielens/ratings.dat')
mpath = os.path.expanduser('../pydata-book-master/ch02/movielens/movies.dat')

unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
mnames = ['movie_id', 'title', 'genres']

users = pd.read_csv(upath, sep='::', header=None, names=unames, encoding=encoding)
ratings = pd.read_csv(rpath, sep='::', header=None, names=rnames, encoding=encoding)
movies = pd.read_csv(mpath, sep='::', header=None, names=mnames, encoding=encoding)

users[:5]

결과:
user_id gender age occupation zip
0 1 F 1 10 48067
1 2 M 56 16 70072
2 3 M 25 15 55117
3 4 M 45 7 02460
4 5 M 25 20 55455
 

#movie ratings users 3개의 테이블 통합

data = pd.merge(pd.merge(ratings, users), movies)
data.ix[0]

 

# 영화별 남녀 평점 평균
mean_ratings = data.pivot_table('rating', index='title', columns='gender', aggfunc='mean')
mean_ratings[:5]
gender F M
title
$1,000,000 Duck (1971) 3.375000 2.761905
'Night Mother (1986) 3.388889 3.352941
'Til There Was You (1997) 2.675676 2.733333
'burbs, The (1989) 2.793478 2.962085
...And Justice for All (1979) 3.828571 3.689024
 

#250건 이상의 평점을 보유한 영화에 대해서만 정보 추출
ratings_by_title = data.groupby('title').size()
ratings_by_title[:10]

결과 :





title
$1,000,000 Duck (1971) 37
'Night Mother (1986) 70
'Til There Was You (1997) 52
'burbs, The (1989) 303
...And Justice for All (1979) 199
1-900 (1994) 2
10 Things I Hate About You (1999) 700
101 Dalmatians (1961) 565
101 Dalmatians (1996) 364
12 Angry Men (1957) 616
dtype: int64








active_titles = ratings_by_title.index[ratings_by_title >= 250]
mean_ratings = mean_ratings.ix[active_titles]
mean_ratings



gender F M
title
'burbs, The (1989) 2.793478 2.962085
10 Things I Hate About You (1999) 3.646552 3.311966
101 Dalmatians (1961) 3.791444 3.500000
101 Dalmatians (1996) 3.240000 2.911215
12 Angry Men (1957) 4.184397 4.328421
13th Warrior, The (1999) 3.112000 3.168000
2 Days in the Valley (1996) 3.488889 3.244813
20,000 Leagues Under the Sea (1954) 3.670103 3.709205
2001: A Space Odyssey (1968) 3.825581 4.129738
2010 (1984) 3.446809 3.413712
28 Days (2000) 3.209424 2.977707
39 Steps, The (1935) 3.965517 4.107692
54 (1998) 2.701754 2.782178
7th Voyage of Sinbad, The (1958) 3.409091 3.658879
8MM (1999) 2.906250 2.850962
About Last Night... (1986) 3.188679 3.140909
Absent Minded Professor, The (1961) 3.469388 3.446809
Absolute Power (1997) 3.469136 3.327759
Abyss, The (1989) 3.659236 3.689507
Ace Ventura: Pet Detective (1994) 3.000000 3.197917
Ace Ventura: When Nature Calls (1995) 2.269663 2.543333
Addams Family Values (1993) 3.000000 2.878531
Addams Family, The (1991) 3.186170 3.163498
Adventures in Babysitting (1987) 3.455782 3.208122
Adventures of Buckaroo Bonzai Across the 8th Dimension, The (1984) 3.308511 3.402321
Adventures of Priscilla, Queen of the Desert, The (1994) 3.989071 3.688811
Adventures of Robin Hood, The (1938) 4.166667 3.918367
African Queen, The (1951) 4.324232 4.223822
Age of Innocence, The (1993) 3.827068 3.339506
Agnes of God (1985) 3.534884 3.244898
... ... ...
White Men Can't Jump (1992) 3.028777 3.231061
Who Framed Roger Rabbit? (1988) 3.569378 3.713251
Who's Afraid of Virginia Woolf? (1966) 4.029703 4.096939
Whole Nine Yards, The (2000) 3.296552 3.404814
Wild Bunch, The (1969) 3.636364 4.128099
Wild Things (1998) 3.392000 3.459082
Wild Wild West (1999) 2.275449 2.131973
William Shakespeare's Romeo and Juliet (1996) 3.532609 3.318644
Willow (1988) 3.658683 3.453543
Willy Wonka and the Chocolate Factory (1971) 4.063953 3.789474
Witness (1985) 4.115854 3.941504
Wizard of Oz, The (1939) 4.355030 4.203138
Wolf (1994) 3.074074 2.899083
Women on the Verge of a Nervous Breakdown (1988) 3.934307 3.865741
Wonder Boys (2000) 4.043796 3.913649
Working Girl (1988) 3.606742 3.312500
World Is Not Enough, The (1999) 3.337500 3.388889
Wrong Trousers, The (1993) 4.588235 4.478261
Wyatt Earp (1994) 3.147059 3.283898
X-Files: Fight the Future, The (1998) 3.489474 3.493797
X-Men (2000) 3.682310 3.851702
Year of Living Dangerously (1982) 3.951220 3.869403
Yellow Submarine (1968) 3.714286 3.689286
You've Got Mail (1998) 3.542424 3.275591
Young Frankenstein (1974) 4.289963 4.239177
Young Guns (1988) 3.371795 3.425620
Young Guns II (1990) 2.934783 2.904025
Young Sherlock Holmes (1985) 3.514706 3.363344
Zero Effect (1998) 3.864407 3.723140
eXistenZ (1999) 3.098592 3.289086
 

평점 차이 구하기
  • 남녀간의 호불호가 갈리는 영화 탐색

  • 남녀의 평균평점 차이를 담을 수 있는 열을 추가함으로써 확인할 수 있음


mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']
sorted_by_diff = mean_ratings.sort_values(by='diff')
sorted_by_diff[:15]
gender F M diff
title
Dirty Dancing (1987) 3.790378 2.959596 -0.830782
Jumpin' Jack Flash (1986) 3.254717 2.578358 -0.676359
Grease (1978) 3.975265 3.367041 -0.608224
Little Women (1994) 3.870588 3.321739 -0.548849
Steel Magnolias (1989) 3.901734 3.365957 -0.535777
Anastasia (1997) 3.800000 3.281609 -0.518391
Rocky Horror Picture Show, The (1975) 3.673016 3.160131 -0.512885
Color Purple, The (1985) 4.158192 3.659341 -0.498851
Age of Innocence, The (1993) 3.827068 3.339506 -0.487561
Free Willy (1993) 2.921348 2.438776 -0.482573
French Kiss (1995) 3.535714 3.056962 -0.478752
Little Shop of Horrors, The (1960) 3.650000 3.179688 -0.470312
Guys and Dolls (1955) 4.051724 3.583333 -0.468391
Mary Poppins (1964) 4.197740 3.730594 -0.467147
Patch Adams (1998) 3.473282 3.008746 -0.464536
 

# 표준편차를 이용하여 성별에 관계없이 호불호가 극명하게 갈리는 영화 탐색
# 표준편차가 큰 영화
rating_std_by_title = data.groupby('title')['rating'].std()

# active_titles만 선택
rating_std_by_title = rating_std_by_title.ix[active_titles]
rating_std_by_title.sort_values(ascending=False)[:10]

결과 :
title
Dumb & Dumber (1994) 1.321333
Blair Witch Project, The (1999) 1.316368
Natural Born Killers (1994) 1.307198
Tank Girl (1995) 1.277695
Rocky Horror Picture Show, The (1975) 1.260177
Eyes Wide Shut (1999) 1.259624
Evita (1996) 1.253631
Billy Madison (1995) 1.249970
Fear and Loathing in Las Vegas (1998) 1.246408
Bicentennial Man (1999) 1.245533
Name: rating, dtype: float64

 

신생아 이름
  • 시대별로 특정 이름이 차지하는 비율을 구해 얼마나 흔한 이름인지 알아보기

  • 이름의 상대 순위

  • 각 연도별 가장 인기 있는 이름

  • 이름 유행 분석

  • 외부 자료를 통한 유행 분석


names1880 = pd.read_csv('../pydata-book-master/ch02/names/yob1880.txt', names=['name', 'sex', 'births'])
names1880.groupby('sex').births.sum()

years = range(1880, 2011)

pieces = []
columns = ['name', 'sex', 'births']

for year in years:
path = '../pydata-book-master/ch02/names/yob%d.txt' % year
frame = pd.read_csv(path, names=columns)

frame['year'] = year
pieces.append(frame)

names = pd.concat(pieces, ignore_index=True)

total_births = names.pivot_table('births', index='year', columns='sex', aggfunc=sum)

#전체 출생수에서 차지하는 이름의 비율

def add_prop(group):
births = group.births.astype(float)
group['prop'] = births / births.sum()
return group

names = names.groupby(['year', 'sex']).apply(add_prop)
total_births.plot(title='Total births by sex and year')

s6

# 비율값의 총합이 1인지 확인하기 위해 새너티 테스트 수행
import numpy as np
np.allclose(names.groupby(['year','sex']).prop.sum(),1)

 

# 연도별 가장 빈도수가 높은 이름 1000개 추출

def get_top1000(group):
return group.sort_values(by='births', ascending=False)[:1000]

grouped = names.groupby(['year','sex'])
top1000 = grouped.apply(get_top1000)
top1000.index = np.arange(len(top1000))

top1000.index

결과:

True
Int64Index([     0,      1,      2,      3,      4,      5,      6,      7,
8, 9,
...
261867, 261868, 261869, 261870, 261871, 261872, 261873, 261874,
261875, 261876],
dtype='int64', length=261877)

이름 유행 분석
boys = top1000[top1000.sex == 'M']
girls = top1000[top1000.sex == 'F']
total_births = top1000.pivot_table('births', index = 'year', columns = 'name', aggfunc = sum)
total_births

subset = total_births[['John', 'Harry', 'Mary', 'Marilyn']]
subset.plot(subplots=True, figsize=(12, 10), grid=False, title="Number of births per year")

# 그래프 결과로 최근에 들어 흔한 이름을 기피하는 것으로 나타남