Slice & Dice Data Analysis using Pandas

Slice & Dice Data Analysis using Pandas Guido Kollerie @guidok

PyGrunn May 9th, 2014

$ who am i

$ who am i

gkoller ttys001 May 09 14:35

$ who am i

$ who am i •

Freelance Software Developer

•

Python whenever I can

•

Though I’ve done my share of Perl, Java & C#

•

Living in Amsterdam (for now)

Pandas

What is Pandas? A data analysis library for Python that … provides rich data structures and functions designed to make working with structured data fast, easy, and expressive. and ... combines the high performance array-computing features of NumPy with the flexible data manipulation capabilities of spreadsheets and relational databases Python for Data Analysis - Wes McKinney

What is NumPy? NumPy is an extension to the Python programming language, adding support for large, multidimensional arrays and matrices, along with a large library of high-level mathematical functions to operate on these arrays.

https://en.wikipedia.org/wiki/NumPy

Installing Pandas

Installing Pandas … in 8 minutes

Installing Pandas … in 8 minutes

$ pyvenv-3.4 env $ source env/bin/activate $ pip -v install pandas \ ipython[all] \ matplotlib

Why IPython? •

IPython Notebooks

•

web-based interactive computational environment

•

combines code execution, text, mathematics, plots and rich media into a single document

•

A must-have for Pandas development

Starting IPython

$ ipython notebook

Starting IPython

$ ipython notebook --pylab=inline

IPython Notebook

Pandas Data Structures

•

Series

•

DataFrame

•

Panel (won’t cover)

Pandas Data Structures

•

Series

•

DataFrame

•

Panel (won’t cover)

Common Imports import pandas as pd from pandas import Series, DataFrame

Series

•

Array like data structure

•

Backed by a NumPy array

Series - Creation s = Series(randn(5)) ! !

0 1.523850 1 -1.013846 2 0.844459 3 -0.316547 4 -0.476972 dtype: float64

Series - Creation With a specified index

s = Series(randn(5), index=list('abcde')) ! !

a -0.190127 b -1.349079 c 1.294381 d 0.045708 e 1.630447 dtype: float64

Series - Creation Using a dictionary

s = Series(dict(one=1, two=2, three=3, four=4, five=5)) !

five 5 four 4 one 1 three 3 two 2 dtype: int64

Series - Selection s['one'] # as a dictionary 1 !

s[0] # as a list 5 !

s[:2] # slice operation five 5 four 4 dtype: int64 !

s[s > 3] # boolean based indexing five 5 four 4 dtype: int64

Series - Operations s.min(), s.max(), s.mean(), s.sum() (1, 5, 3.0, 15) !

s * 2 # vector based operations five 10 four 8 one 2 three 6 two 4 dtype: int64

Series - Operations User defined

import string s.apply(lambda x: string.ascii_letters[x]) !

five f four e one b three d two c dtype: object

Series - Operations User defined

import string s.apply(lambda x: string.ascii_letters[x]) !

five f four e one b three d two c dtype: object

Series - Operations Vector based u = Series(randn(5)) v = Series(randn(7)) u + v !

0 -1.505072 1 3.716130 2 -0.294903 3 -1.323626 4 -1.517751 5 NaN 6 NaN dtype: float64





DataFrame •

2D array like

•

Labelled rows and columns

•

Heterogeneously typed

DF - Creation df = DataFrame(dict(foo=[1,2,3,4], bar=[5.,6.,8.,9.]))

df.dtypes bar float64 foo int64 dtype: object

DF - Creation With a specified index df1 = DataFrame(dict(foo=[1,2,3,4], bar=[5.,6.,8.,9.]), index=['one','two','three','four'])

DF - Creation Using dictionaries df2 = DataFrame(dict(u=u, v=v)) # dict of Series

DF - Creation From a CSV file df3 = pd.read_csv('04. Inschrijvingen wo_tcm33-32296.csv', sep=';', encoding=‘latin-1') df3.head()

http://data.duo.nl/organisatie/open_onderwijsdata/databestanden/ho/Ingeschreven/ingeschrevenen_wo/wo_inschrijvingen.asp

DF - Selection df1['bar'] # column selection df1.bar # column selection via attribute one 5 two 6 three 8 four 9 Name: bar, dtype: float64

DF - Selection df1.loc['one'] # row selection by label df1.iloc[0] # row selection by integer bar 5 foo 1 Name: one, dtype: float64 !

df1[1:3] # slice rows

DF - Selection df1[df1['bar'] > 6] # select rows by boolean vector

DF - Addition/Deletion df1['foobar'] = df1['foo'] + df1['bar']

DF - Addition/Deletion df1['zero'] = 0 # assign a scalar value to column del df1['foobar']

DF - Merging/Joining # straight from the pandas' documentation left = DataFrame({'key1': ['foo', 'foo', 'bar'], 'key2': ['one', 'two', 'one'], 'lval': [1, 2, 3]}) ! !

right = DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'], 'key2': ['one', 'one', 'one', 'two'], 'rval': [4, 5, 6, 7]})

DF - Merging/Joining left

right

DF - Merging/Joining pd.merge(left, right, how='outer')

pd.merge(left, right, how='inner')

Some Simple Data Analysis

Some Simple Data Analysis

Number of Students wo = df3 # wo -> Wetenschappelijk Onderwijs wo.rename(columns=str.lower, inplace=True) wo.columns !

Index(['provincie', 'gemeentenummer', 'gemeentenaam', 'brin nummer actueel', 'instellingsnaam actueel', 'croho onderdeel', 'croho subonderdeel', 'opleidingscode actueel', 'opleidingsnaam actueel', 'opleidingsvorm', 'opleidingsfase actueel', '2008 man', '2008 vrouw', '2009 man', '2009 vrouw', '2010 man', '2010 vrouw', '2011 man', '2011 vrouw', '2012 man', '2012 vrouw'], dtype=object)

Number of Students wo = df3 # wo -> Wetenschappelijk Onderwijs wo.rename(columns=str.lower, inplace=True) wo.columns !

Index(['provincie', 'gemeentenummer', 'gemeentenaam', 'brin nummer actueel', 'instellingsnaam actueel', 'croho onderdeel', 'croho subonderdeel', 'opleidingscode actueel', 'opleidingsnaam actueel', 'opleidingsvorm', 'opleidingsfase actueel', '2008 man', '2008 vrouw', '2009 man', '2009 vrouw', '2010 man', '2010 vrouw', '2011 man', '2011 vrouw', '2012 man', '2012 vrouw'], dtype=object)

Select Columns men = [c for c in wo.columns.tolist() if c.endswith('man')] women = [c for c in wo.columns.tolist() if c.endswith(‘vrouw’)] !

# croho -> Centraal Register Opleidingen Hoger Onderwijs uni = wo[['instellingsnaam actueel', 'croho onderdeel'] + men + women]

Slice & Dice

Group By & Sum # math operations ignore nuisance columns (non-numeric cols) uni_size = uni.groupby('instellingsnaam actueel').sum()

(Men + Women) / Year uni_size.index.name = ‘Instelling’ !

uni_size['2008'] uni_size['2009'] uni_size['2010'] uni_size['2011'] uni_size['2012']

= = = = =

uni_size['2008 uni_size['2009 uni_size['2010 uni_size['2011 uni_size['2012

man'] man'] man'] man'] man']

+ + + + +

uni_size['2008 uni_size['2009 uni_size['2010 uni_size['2011 uni_size['2012

vrouw'] vrouw'] vrouw'] vrouw'] vrouw’]

!

# select only relevant columns uni_size = uni_size[[‘2008','2009','2010','2011','2012']]

Sort uni_size.sort(columns='2012', ascending=False).head()

Sort uni_size.sort(columns='2012', ascending=False).head()

Sort # axis=0 -> by row, axis=1 -> by column uni_size.sort(axis=1, ascending=False)

Sort # axis=0 -> by row, axis=1 -> by column uni_size.sort(axis=1, ascending=False)

uni_size.sort( axis=1, ascending=False).sort( columns='2012', ascending=False).plot( kind=‘barh', figsize=[10,10])

Quick One

Men/Women Diffs opl = wo.groupby('opleidingsnaam actueel').sum() opl = opl[['2012 man', '2012 vrouw']] opl['diff'] = opl['2012 man'] - opl['2012 vrouw'] sorted_opl = opl.sort(columns='diff', ascending=False) !

top5_max = sorted_opl[:5] top5_min = sorted_opl[-5:] top5 = pd.concat([top5_max, top5_min]) !

top5['diff'].plot(kind='barh')

Men/Women Diffs

CS like studies cs_crit = wo['opleidingsnaam actueel’].\ str.contains('Informatica') cs = wo[cs_crit] cs['opleidingsnaam actueel’].value_counts() !

B Informatica B Technische Informatica Technische Informatica Informatica B Economie en Informatica M Informatica M Lerarenopleiding Informatica dtype: int64

12 8 6 4 2 2 1

Pivot Tables pv = pd.pivot_table(wo, values=['2012 man', '2012 vrouw'], rows=['instellingsnaam actueel'], cols=['croho onderdeel'], fill_value=0, aggfunc=np.sum)

Pivot Tables

Pivot Tables Multi-level index/columns

Stack

Stack

Stack

Stack pv.stack(1)

Unstack pv.loc['Universiteit van Amsterdam'] ! croho onderdeel 2012 man economie gedrag en maatschappij gezondheidszorg landbouw en natuurlijke omgeving natuur onderwijs recht sectoroverstijgend taal en cultuur 2012 vrouw economie gedrag en maatschappij gezondheidszorg landbouw en natuurlijke omgeving natuur onderwijs recht sectoroverstijgend taal en cultuur Name: Universiteit van Amsterdam, dtype: int64

2650 2727 1248 0 2152 125 1573 365 2795 1448 5703 2077 0 1344 124 2305 497 4725

Unstack pv.loc['Universiteit van Amsterdam'] Multi-level index

! croho onderdeel 2012 man economie gedrag en maatschappij gezondheidszorg landbouw en natuurlijke omgeving natuur onderwijs recht sectoroverstijgend taal en cultuur 2012 vrouw economie gedrag en maatschappij gezondheidszorg landbouw en natuurlijke omgeving natuur onderwijs recht sectoroverstijgend taal en cultuur Name: Universiteit van Amsterdam, dtype: int64

2650 2727 1248 0 2152 125 1573 365 2795 1448 5703 2077 0 1344 124 2305 497 4725

Unstack ! croho onderdeel 2012 man economie gedrag en maatschappij gezondheidszorg landbouw en natuurlijke omgeving natuur onderwijs recht sectoroverstijgend taal en cultuur 2012 vrouw economie gedrag en maatschappij gezondheidszorg landbouw en natuurlijke omgeving natuur onderwijs recht sectoroverstijgend taal en cultuur Name: Universiteit van Amsterdam, dtype: int64

2650 2727 1248 0 2152 125 1573 365 2795 1448 5703 2077 0 1344 124 2305 497 4725


2650 2727 1248 0 2152 125 1573 365 2795 1448 5703 2077 0 1344 124 2305 497 4725


2650 2727 1248 0 2152 125 1573 365 2795 1448 5703 2077 0 1344 124 2305 497 4725

Unstack uva = pv.loc['Universiteit van Amsterdam'].unstack()

Transpose uva.T

Lots more •

Reording levels within multi-level indices

•

Time Series (with up- & downsampling)

•

Linear Regression (via statsmodels)

Want to learn more?

Slice & Dice Data Analysis using Pandas

Recommend Documents