02.04 - PANDAS¶

!wget --no-cache -O init.py -q https://raw.githubusercontent.com/rramosp/ai4eng.v1/main/content/init.py
import init; init.init(force_download=False); init.get_weblink()

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

`pandas` is mostly about manipulating tables of data¶

see this cheat sheet: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

Pandas main object is a `DataFrame`¶

can read .csv, .excel, etc.

!head local/data/internet_facebook.dat

# Pais,Uso_Internet,Uso_Facebook
Argentina,49.40,30.53
Australia,80.60,46.01
Belgium,67.30,36.98
Brazil,37.76,4.39
Canada,72.30,52.08
Chile,50.90,46.14
China,22.40,0.05
Colombia,38.80,25.90
Egypt,12.90,5.68

!wc local/data/weather_data_austin_2010.csv

  8760  17519 254046 local/data/weather_data_austin_2010.csv

df = pd.read_csv('local/data/internet_facebook.dat', index_col='# Pais')
df

	Uso_Internet	Uso_Facebook
# Pais
Argentina	49.40	30.53
Australia	80.60	46.01
Belgium	67.30	36.98
Brazil	37.76	4.39
Canada	72.30	52.08
Chile	50.90	46.14
China	22.40	0.05
Colombia	38.80	25.90
Egypt	12.90	5.68
France	65.70	32.91
Germany	67.00	14.07
Hong_Kong	69.50	52.33
India	7.10	1.52
Indonesia	10.50	13.49
Italy	48.80	30.62
Japan	73.80	2.00
Malaysia	62.80	37.77
Mexico	24.90	16.80
Netherlands	82.90	20.54
Peru	26.20	13.34
Philippines	21.50	19.68
Poland	52.00	11.79
Russia	27.00	2.99
Saudi_Arabia	22.70	11.65
South_Africa	10.50	7.83
Spain	66.80	30.24
Sweden	80.70	44.72
Taiwan	66.10	38.21
Thailand	20.50	10.29
Turkey	35.00	31.91
USA	77.33	46.98
UK	70.18	45.97
Venezuela	25.50	28.64

df.head()

	Uso_Internet	Uso_Facebook
# Pais
Argentina	49.40	30.53
Australia	80.60	46.01
Belgium	67.30	36.98
Brazil	37.76	4.39
Canada	72.30	52.08

df.tail()

	Uso_Internet	Uso_Facebook
# Pais
Thailand	20.50	10.29
Turkey	35.00	31.91
USA	77.33	46.98
UK	70.18	45.97
Venezuela	25.50	28.64

df.columns

Index(['Uso_Internet', 'Uso_Facebook'], dtype='object')

df.index

Index(['Argentina', 'Australia', 'Belgium', 'Brazil', 'Canada', 'Chile',
       'China', 'Colombia', 'Egypt', 'France', 'Germany', 'Hong_Kong', 'India',
       'Indonesia', 'Italy', 'Japan', 'Malaysia', 'Mexico', 'Netherlands',
       'Peru', 'Philippines', 'Poland', 'Russia', 'Saudi_Arabia',
       'South_Africa', 'Spain', 'Sweden', 'Taiwan', 'Thailand', 'Turkey',
       'USA', 'UK', 'Venezuela'],
      dtype='object', name='# Pais')

fix the index name

df.index.name="Pais"
df.head()

	Uso_Internet	Uso_Facebook
Pais
Argentina	49.40	30.53
Australia	80.60	46.01
Belgium	67.30	36.98
Brazil	37.76	4.39
Canada	72.30	52.08

df.describe()

	Uso_Internet	Uso_Facebook
count	33.000000	33.000000
mean	46.890000	24.668182
std	24.456421	16.511662
min	7.100000	0.050000
25%	24.900000	11.650000
50%	49.400000	25.900000
75%	67.300000	37.770000
max	82.900000	52.330000

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 33 entries, Argentina to Venezuela
Data columns (total 2 columns):
Uso_Internet    33 non-null float64
Uso_Facebook    33 non-null float64
dtypes: float64(2)
memory usage: 792.0+ bytes

a dataframe is made of Series. Observe that each series has its own type

s1 = df["Uso_Internet"]
type(s1)

pandas.core.series.Series

s1

Pais
Argentina       49.40
Australia       80.60
Belgium         67.30
Brazil          37.76
Canada          72.30
Chile           50.90
China           22.40
Colombia        38.80
Egypt           12.90
France          65.70
Germany         67.00
Hong_Kong       69.50
India            7.10
Indonesia       10.50
Italy           48.80
Japan           73.80
Malaysia        62.80
Mexico          24.90
Netherlands     82.90
Peru            26.20
Philippines     21.50
Poland          52.00
Russia          27.00
Saudi_Arabia    22.70
South_Africa    10.50
Spain           66.80
Sweden          80.70
Taiwan          66.10
Thailand        20.50
Turkey          35.00
USA             77.33
UK              70.18
Venezuela       25.50
Name: Uso_Internet, dtype: float64

if the column name is not too fancy (empy spaces, accents, etc.) we can use columns names as python syntax.

df.Uso_Facebook

Pais
Argentina       30.53
Australia       46.01
Belgium         36.98
Brazil           4.39
Canada          52.08
Chile           46.14
China            0.05
Colombia        25.90
Egypt            5.68
France          32.91
Germany         14.07
Hong_Kong       52.33
India            1.52
Indonesia       13.49
Italy           30.62
Japan            2.00
Malaysia        37.77
Mexico          16.80
Netherlands     20.54
Peru            13.34
Philippines     19.68
Poland          11.79
Russia           2.99
Saudi_Arabia    11.65
South_Africa     7.83
Spain           30.24
Sweden          44.72
Taiwan          38.21
Thailand        10.29
Turkey          31.91
USA             46.98
UK              45.97
Venezuela       28.64
Name: Uso_Facebook, dtype: float64

DataFrame indexing¶

is NOT exactly like numpy

first index
- if string refers to columns
- if Series of booleans is used as a filter
for selecting columns:
- use .loc to select by Index
- use .iloc to select by position

df["Colombia"]

---------------------------------------------------------------------
KeyError                            Traceback (most recent call last)
/opt/anaconda3/envs/p37/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2896             try:
-> 2897                 return self._engine.get_loc(key)
   2898             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'Colombia'

During handling of the above exception, another exception occurred:

KeyError                            Traceback (most recent call last)
<ipython-input-251-5f701fabe22b> in <module>
----> 1 df["Colombia"]

/opt/anaconda3/envs/p37/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2993             if self.columns.nlevels > 1:
   2994                 return self._getitem_multilevel(key)
-> 2995             indexer = self.columns.get_loc(key)
   2996             if is_integer(indexer):
   2997                 indexer = [indexer]

/opt/anaconda3/envs/p37/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2897                 return self._engine.get_loc(key)
   2898             except KeyError:
-> 2899                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2900         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2901         if indexer.ndim > 1 or indexer.size > 1:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'Colombia'

df.loc["Colombia"]

Uso_Internet    38.8
Uso_Facebook    25.9
Name: Colombia, dtype: float64

Index semantics is exact!!

df.loc["Colombia":"Spain"]

	Uso_Internet	Uso_Facebook
Pais
Colombia	38.8	25.90
Egypt	12.9	5.68
France	65.7	32.91
Germany	67.0	14.07
Hong_Kong	69.5	52.33
India	7.1	1.52
Indonesia	10.5	13.49
Italy	48.8	30.62
Japan	73.8	2.00
Malaysia	62.8	37.77
Mexico	24.9	16.80
Netherlands	82.9	20.54
Peru	26.2	13.34
Philippines	21.5	19.68
Poland	52.0	11.79
Russia	27.0	2.99
Saudi_Arabia	22.7	11.65
South_Africa	10.5	7.83
Spain	66.8	30.24

df.iloc[10:15]

	Uso_Internet	Uso_Facebook
Pais
Germany	67.0	14.07
Hong_Kong	69.5	52.33
India	7.1	1.52
Indonesia	10.5	13.49
Italy	48.8	30.62

filtering

df[df.Uso_Internet>80]

	Uso_Internet	Uso_Facebook
Pais
Australia	80.6	46.01
Netherlands	82.9	20.54
Sweden	80.7	44.72

combined conditions

df[(df.Uso_Internet>50)&(df.Uso_Facebook>50)]

	Uso_Internet	Uso_Facebook
Pais
Canada	72.3	52.08
Hong_Kong	69.5	52.33

df[(df.Uso_Internet>50)|(df.Uso_Facebook>50)]

	Uso_Internet	Uso_Facebook
Pais
Australia	80.60	46.01
Belgium	67.30	36.98
Canada	72.30	52.08
Chile	50.90	46.14
France	65.70	32.91
Germany	67.00	14.07
Hong_Kong	69.50	52.33
Japan	73.80	2.00
Malaysia	62.80	37.77
Netherlands	82.90	20.54
Poland	52.00	11.79
Spain	66.80	30.24
Sweden	80.70	44.72
Taiwan	66.10	38.21
USA	77.33	46.98
UK	70.18	45.97

Managing data¶

observe csv structure:

missing column name
missing data

!head local/data/comptagevelo2009.csv

Date,,Berri1,Maisonneuve_1,Maisonneuve_2,Brébeuf
01/01/2009,00:00,29,20,35,
02/01/2009,00:00,19,3,22,
03/01/2009,00:00,24,12,22,
04/01/2009,00:00,24,8,15,
05/01/2009,00:00,120,111,141,
06/01/2009,00:00,261,146,236,
07/01/2009,00:00,60,33,80,
08/01/2009,00:00,24,14,14,
09/01/2009,00:00,35,20,32,

d = pd.read_csv("local/data/comptagevelo2009.csv")
d

	Date	Unnamed: 1	Berri1	Maisonneuve_1	Maisonneuve_2	Brébeuf
0	01/01/2009	00:00	29	20	35	NaN
1	02/01/2009	00:00	19	3	22	NaN
2	03/01/2009	00:00	24	12	22	NaN
3	04/01/2009	00:00	24	8	15	NaN
4	05/01/2009	00:00	120	111	141	NaN
...	...	...	...	...	...	...
360	27/12/2009	00:00	66	29	52	0.0
361	28/12/2009	00:00	61	41	99	0.0
362	29/12/2009	00:00	89	52	115	0.0
363	30/12/2009	00:00	76	43	115	0.0
364	31/12/2009	00:00	53	46	112	0.0

365 rows × 6 columns

d.columns, d.shape

(Index(['Date', 'Unnamed: 1', 'Berri1', 'Maisonneuve_1', 'Maisonneuve_2',
        'Brébeuf'],
       dtype='object'), (365, 6))

numerical features

d.describe()

	Berri1	Maisonneuve_1	Maisonneuve_2	Brébeuf
count	365.000000	365.000000	365.000000	178.000000
mean	2032.200000	1060.252055	2093.169863	2576.359551
std	1878.879799	1079.533086	1854.368523	2484.004743
min	0.000000	0.000000	0.000000	0.000000
25%	194.000000	90.000000	228.000000	0.000000
50%	1726.000000	678.000000	1686.000000	1443.500000
75%	3540.000000	1882.000000	3520.000000	4638.000000
max	6626.000000	4242.000000	6587.000000	7575.000000

d["Berri1"].head()

   29
   19
   24
   24
  120
Name: Berri1, dtype: int64

d["Unnamed: 1"].unique()

array(['00:00'], dtype=object)

d["Berri1"].unique()

array([  29,   19,   24,  120,  261,   60,   35,   81,  318,  105,  168,
        145,  131,   93,   25,   52,  136,  147,  109,  172,  148,   15,
        209,   92,  110,   14,  158,  179,  122,   95,  185,   82,  190,
        228,  306,  188,   98,  139,  258,  304,  326,  134,  125,   96,
         65,  123,  129,  154,  239,  198,   32,   67,  157,  164,  300,
        176,  195,  310,    7,  366,  234,  132,  203,  298,  541,  525,
        871,  592,  455,  446,  441,  266,  189,  343,  292,  355,  245,
          0,  445, 1286, 1178, 2131, 2709,  752, 1886, 2069, 3132, 3668,
       1368, 4051, 2286, 3519, 3520, 1925, 2125, 2662, 4403, 4338, 2757,
        970, 2767, 1493,  728, 3982, 4742, 5278, 2344, 4094,  784, 1048,
       2442, 3686, 3042, 5728, 3815, 3540, 4775, 4434, 4363, 2075, 2338,
       1387, 2063, 2031, 3274, 4325, 5430, 6028, 3876, 2742, 4973, 1125,
       3460, 4449, 3576, 4027, 4313, 3182, 5668, 6320, 2397, 2857, 2590,
       3234, 5138, 5799, 4911, 4333, 3680, 1536, 3064, 1004, 4709, 4471,
       4432, 2997, 2544, 5121, 3862, 3036, 3744, 6626, 6274, 1876, 4393,
       3471, 3537, 6100, 3489, 4859, 2991, 3588, 5607, 5754, 3440, 5124,
       4054, 4372, 1801, 4088, 5891, 3754, 5267, 3146,   63,   77, 5904,
       4417, 5611, 4197, 4265, 4589, 2775, 2999, 3504, 5538, 5386, 3916,
       3307, 4382, 5327, 3796, 2832, 3492, 2888, 4120, 5450, 4722, 4707,
       4439, 2277, 4572, 5298, 5451, 5372, 4566, 3533, 3888, 3683, 5452,
       5575, 5496, 4864, 3985, 2695, 4196, 5169, 4891, 4915, 2435, 2674,
       2855, 4787, 2620, 2878, 4820, 3774, 2603,  725, 1941, 2272, 3003,
       2643, 2865,  993, 1336, 2935, 3852, 2115, 3336, 1302, 1407, 1090,
       1171, 1671, 2456, 2383, 1130, 1241, 2570, 2605, 2904, 1322, 1792,
        542, 1124, 2119, 2072, 1996, 2130, 1835,  473, 1141, 2293, 1655,
       1974, 1767, 1735,  872, 1541, 2540, 2526, 2366, 2224, 2007,  493,
        852, 1881, 2052, 1921, 1935, 1065, 1173,  743, 1579, 1574, 1726,
       1027,  810,  671,  747, 1092, 1377,  606, 1108,  594,  501,  669,
        570,  219,  194,  106,  130,  271,  308,  296,  214,  133,  135,
        207,   74,   34,   40,   66,   61,   89,   76,   53])

d["Berri1"].dtype, d["Date"].dtype, d["Unnamed: 1"].dtype

(dtype('int64'), dtype('O'), dtype('O'))

d.index

RangeIndex(start=0, stop=365, step=1)

Fixing data¶

observe we set one column as the index one, and we convert it to date object type

d.Date

    01/01/2009
    02/01/2009
    03/01/2009
    04/01/2009
    05/01/2009
          ...    
  27/12/2009
  28/12/2009
  29/12/2009
  30/12/2009
  31/12/2009
Name: Date, Length: 365, dtype: object

d.index = pd.to_datetime(d.Date)
del(d["Date"])
del(d["Unnamed: 1"])
d.head()

	Berri1	Maisonneuve_1	Maisonneuve_2	Brébeuf
Date
2009-01-01	29	20	35	NaN
2009-02-01	19	3	22	NaN
2009-03-01	24	12	22	NaN
2009-04-01	24	8	15	NaN
2009-05-01	120	111	141	NaN

d.index

DatetimeIndex(['2009-01-01', '2009-02-01', '2009-03-01', '2009-04-01',
               '2009-05-01', '2009-06-01', '2009-07-01', '2009-08-01',
               '2009-09-01', '2009-10-01',
               ...
               '2009-12-22', '2009-12-23', '2009-12-24', '2009-12-25',
               '2009-12-26', '2009-12-27', '2009-12-28', '2009-12-29',
               '2009-12-30', '2009-12-31'],
              dtype='datetime64[ns]', name='Date', length=365, freq=None)

let’s fix columns names

d.columns=["Berri", "Mneuve1", "Mneuve2", "Brebeuf"]
d.head()

	Berri	Mneuve1	Mneuve2	Brebeuf
Date
2009-01-01	29	20	35	NaN
2009-02-01	19	3	22	NaN
2009-03-01	24	12	22	NaN
2009-04-01	24	8	15	NaN
2009-05-01	120	111	141	NaN

for col in d.columns:
    print (col, np.sum(pd.isnull(d[col])))

Berri 0
Mneuve1 0
Mneuve2 0
Brebeuf 187

d.shape

(365, 4)

d['Brebeuf'].describe()

count     178.000000
mean     2576.359551
std      2484.004743
min         0.000000
25%         0.000000
50%      1443.500000
75%      4638.000000
max      7575.000000
Name: Brebeuf, dtype: float64

plt.hist(d.Brebeuf, bins=30);

../_images/NOTES 02.04 - PANDAS_57_0.png

fix missing!!!

d.Brebeuf.fillna(d.Brebeuf.mean(), inplace=True)

d['Brebeuf'].describe()

count     365.000000
mean     2576.359551
std      1732.161423
min         0.000000
25%      1588.000000
50%      2576.359551
75%      2576.359551
max      7575.000000
Name: Brebeuf, dtype: float64

plt.hist(d.Brebeuf, bins=30);

../_images/NOTES 02.04 - PANDAS_61_0.png

	Berri	Mneuve1	Mneuve2	Brebeuf
Date
2009-01-01	29	20	35	2576.359551
2009-02-01	19	3	22	2576.359551
2009-03-01	24	12	22	2576.359551
2009-04-01	24	8	15	2576.359551
2009-05-01	120	111	141	2576.359551
...	...	...	...	...
2009-12-27	66	29	52	0.000000
2009-12-28	61	41	99	0.000000
2009-12-29	89	52	115	0.000000
2009-12-30	76	43	115	0.000000
2009-12-31	53	46	112	0.000000

365 rows × 4 columns

let’s make sure it is sorted

d.sort_index(inplace=True)
d.head()

	Berri	Mneuve1	Mneuve2	Brebeuf
Date
2009-01-01	29	20	35	2576.359551
2009-01-02	14	2	2	2576.359551
2009-01-03	67	30	80	2576.359551
2009-01-04	0	0	0	2576.359551
2009-01-05	1925	1256	1501	2576.359551

Filtering¶

d[d.Berri>6000]

	Berri	Mneuve1	Mneuve2	Brebeuf
Date
2009-05-06	6028	4120	4223	2576.359551
2009-06-17	6320	3388	6047	2576.359551
2009-07-15	6100	3767	5536	6939.000000
2009-09-07	6626	4227	5751	7575.000000
2009-10-07	6274	4242	5435	7268.000000

d[(d.Berri>6000) & (d.Brebeuf<7000)]

	Berri	Mneuve1	Mneuve2	Brebeuf
Date
2009-05-06	6028	4120	4223	2576.359551
2009-06-17	6320	3388	6047	2576.359551
2009-07-15	6100	3767	5536	6939.000000

Locating¶

d[d.Berri>5500].sort_index(axis=0)

	Berri	Mneuve1	Mneuve2	Brebeuf
Date
2009-03-08	5904	3102	4853	7194.000000
2009-05-06	6028	4120	4223	2576.359551
2009-05-08	5611	2646	5201	7121.000000
2009-05-21	5728	3693	5397	2576.359551
2009-06-16	5668	3499	5609	2576.359551
2009-06-17	6320	3388	6047	2576.359551
2009-06-23	5799	3114	5386	2576.359551
2009-07-15	6100	3767	5536	6939.000000
2009-07-20	5607	3825	5092	7064.000000
2009-07-21	5754	3745	5357	6996.000000
2009-07-28	5891	3292	5437	7219.000000
2009-09-07	6626	4227	5751	7575.000000
2009-09-09	5575	2727	6535	6686.000000
2009-10-07	6274	4242	5435	7268.000000
2009-12-08	5538	2368	5107	7127.000000

d.iloc[100:110]

	Berri	Mneuve1	Mneuve2	Brebeuf
Date
2009-04-11	1974	1113	2693	1046.000000
2009-04-12	1108	595	1472	0.000000
2009-04-13	0	0	0	2576.359551
2009-04-14	0	0	0	2576.359551
2009-04-15	0	0	0	2576.359551
2009-04-16	0	0	0	2576.359551
2009-04-17	1286	820	1436	2576.359551
2009-04-18	1178	667	826	2576.359551
2009-04-19	2131	1155	1426	2576.359551
2009-04-20	2709	1697	2646	2576.359551

dates as INDEX have special semantics

d.loc["2009-10-01":"2009-10-10"]

	Berri	Mneuve1	Mneuve2	Brebeuf
Date
2009-10-01	81	45	79	2576.359551
2009-10-02	228	101	260	2576.359551
2009-10-03	366	203	354	2576.359551
2009-10-04	0	0	0	2576.359551
2009-10-05	728	362	523	2576.359551
2009-10-06	3460	2354	3978	2576.359551
2009-10-07	6274	4242	5435	7268.000000
2009-10-08	2999	1545	3185	4187.000000
2009-10-09	5496	2921	6587	6520.000000
2009-10-10	1407	725	1443	1003.000000

can do sorting across any criteria

d.sort_values(by="Berri").head()

	Berri	Mneuve1	Mneuve2	Brebeuf
Date
2009-07-04	0	0	0	2576.359551
2009-03-30	0	0	0	2576.359551
2009-04-04	0	0	0	2576.359551
2009-04-13	0	0	0	2576.359551
2009-04-14	0	0	0	2576.359551

and chain operations

d.sort_values(by="Berri").loc["2009-10-01":"2009-10-10"]

	Berri	Mneuve1	Mneuve2	Brebeuf
Date
2009-10-04	0	0	0	2576.359551
2009-10-01	81	45	79	2576.359551
2009-10-02	228	101	260	2576.359551
2009-10-03	366	203	354	2576.359551
2009-10-05	728	362	523	2576.359551
2009-10-10	1407	725	1443	1003.000000
2009-10-08	2999	1545	3185	4187.000000
2009-10-06	3460	2354	3978	2576.359551
2009-10-09	5496	2921	6587	6520.000000
2009-10-07	6274	4242	5435	7268.000000

Time series operations¶

d.rolling(3).mean().head(10)

	Berri	Mneuve1	Mneuve2	Brebeuf
Date
2009-01-01	NaN	NaN	NaN	NaN
2009-01-02	NaN	NaN	NaN	NaN
2009-01-03	36.666667	17.333333	39.000000	2576.359551
2009-01-04	27.000000	10.666667	27.333333	2576.359551
2009-01-05	664.000000	428.666667	527.000000	2576.359551
2009-01-06	1733.000000	1116.333333	1409.000000	2576.359551
2009-01-07	3223.333333	1862.666667	2426.000000	2576.359551
2009-01-08	2602.666667	1795.000000	3042.666667	3673.906367
2009-01-09	3277.333333	2029.333333	4055.666667	5128.119850
2009-01-10	2668.000000	1290.333333	4214.666667	4798.666667

d.index = d.index + pd.Timedelta("5m")
d.head()

	Berri	Mneuve1	Mneuve2	Brebeuf
Date
2009-01-01 00:05:00	29	20	35	2576.359551
2009-01-02 00:05:00	14	2	2	2576.359551
2009-01-03 00:05:00	67	30	80	2576.359551
2009-01-04 00:05:00	0	0	0	2576.359551
2009-01-05 00:05:00	1925	1256	1501	2576.359551

d.shift(freq=pd.Timedelta(days=365)).head()

	Berri	Mneuve1	Mneuve2	Brebeuf
Date
2010-01-01 00:05:00	29	20	35	2576.359551
2010-01-02 00:05:00	14	2	2	2576.359551
2010-01-03 00:05:00	67	30	80	2576.359551
2010-01-04 00:05:00	0	0	0	2576.359551
2010-01-05 00:05:00	1925	1256	1501	2576.359551

Downsampling¶

d.resample(pd.Timedelta("2d")).first().head()

	Berri	Mneuve1	Mneuve2	Brebeuf
Date
2009-01-01	29	20	35	2576.359551
2009-01-03	67	30	80	2576.359551
2009-01-05	1925	1256	1501	2576.359551
2009-01-07	4471	2239	3051	2576.359551
2009-01-09	5298	2796	5765	6939.000000

d.resample(pd.Timedelta("2d")).mean().head()

	Berri	Mneuve1	Mneuve2	Brebeuf
Date
2009-01-01	21.5	11.0	18.5	2576.359551
2009-01-03	33.5	15.0	40.0	2576.359551
2009-01-05	2599.5	1674.5	2113.5	2576.359551
2009-01-07	2267.0	1646.0	3201.0	4222.679775
2009-01-09	3970.5	1409.0	4646.5	4263.500000

Upsampling¶

d.resample(pd.Timedelta("12h")).first().head()

	Berri	Mneuve1	Mneuve2	Brebeuf
Date
2009-01-01 00:00:00	29.0	20.0	35.0	2576.359551
2009-01-01 12:00:00	NaN	NaN	NaN	NaN
2009-01-02 00:00:00	14.0	2.0	2.0	2576.359551
2009-01-02 12:00:00	NaN	NaN	NaN	NaN
2009-01-03 00:00:00	67.0	30.0	80.0	2576.359551

d.resample(pd.Timedelta("12h")).fillna(method="pad").head()

	Berri	Mneuve1	Mneuve2	Brebeuf
Date
2009-01-01 00:00:00	NaN	NaN	NaN	NaN
2009-01-01 12:00:00	29.0	20.0	35.0	2576.359551
2009-01-02 00:00:00	29.0	20.0	35.0	2576.359551
2009-01-02 12:00:00	14.0	2.0	2.0	2576.359551
2009-01-03 00:00:00	14.0	2.0	2.0	2576.359551

Building Dataframes from other structures¶

a = np.random.randint(10,size=(20,5))
a

array([[6, 4, 8, 2, 0],
       [8, 9, 7, 4, 9],
       [3, 3, 5, 0, 2],
       [8, 9, 3, 9, 1],
       [6, 9, 4, 5, 9],
       [8, 6, 1, 0, 1],
       [9, 6, 1, 7, 2],
       [0, 7, 4, 6, 1],
       [2, 2, 5, 3, 7],
       [4, 9, 3, 3, 9],
       [7, 3, 5, 6, 8],
       [6, 7, 6, 2, 8],
       [1, 2, 5, 2, 8],
       [6, 7, 6, 9, 4],
       [9, 4, 5, 5, 1],
       [4, 4, 5, 9, 8],
       [4, 4, 1, 1, 2],
       [2, 8, 9, 7, 7],
       [3, 4, 3, 0, 5],
       [5, 8, 5, 6, 2]])

k = pd.DataFrame(a, columns=["uno", "dos", "tres", "cuatro", "cinco"], index=range(10,10+len(a)))
k

	uno	dos	tres	cuatro	cinco
10	6	4	8	2	0
11	8	9	7	4	9
12	3	3	5	0	2
13	8	9	3	9	1
14	6	9	4	5	9
15	8	6	1	0	1
16	9	6	1	7	2
17	0	7	4	6	1
18	2	2	5	3	7
19	4	9	3	3	9
20	7	3	5	6	8
21	6	7	6	2	8
22	1	2	5	2	8
23	6	7	6	9	4
24	9	4	5	5	1
25	4	4	5	9	8
26	4	4	1	1	2
27	2	8	9	7	7
28	3	4	3	0	5
29	5	8	5	6	2

`.values` access the underlying `numpy` structure¶

d.values

array([[2.90000000e+01, 2.00000000e+01, 3.50000000e+01, 2.57635955e+03],
       [1.40000000e+01, 2.00000000e+00, 2.00000000e+00, 2.57635955e+03],
       [6.70000000e+01, 3.00000000e+01, 8.00000000e+01, 2.57635955e+03],
       ...,
       [8.90000000e+01, 5.20000000e+01, 1.15000000e+02, 0.00000000e+00],
       [7.60000000e+01, 4.30000000e+01, 1.15000000e+02, 0.00000000e+00],
       [5.30000000e+01, 4.60000000e+01, 1.12000000e+02, 0.00000000e+00]])

some out-of-the-box plotting¶

but recall that we always can do custom plotting

d.plot(figsize=(15,3))

<matplotlib.axes._subplots.AxesSubplot at 0x7fb9f2a9ac10>

../_images/NOTES 02.04 - PANDAS_93_1.png

plt.figure(figsize=(15,3))
plt.plot(d.Berri)

[<matplotlib.lines.Line2D at 0x7fb9f29f2250>]

../_images/NOTES 02.04 - PANDAS_94_1.png

d.Berri.cumsum().plot()

<matplotlib.axes._subplots.AxesSubplot at 0x7fb9f294c450>

../_images/NOTES 02.04 - PANDAS_95_1.png

plt.scatter(d.Berri, d.Brebeuf)

<matplotlib.collections.PathCollection at 0x7fb9f360fb50>

../_images/NOTES 02.04 - PANDAS_96_1.png

pd.plotting.scatter_matrix(d, figsize=(10,10));

../_images/NOTES 02.04 - PANDAS_97_0.png

Grouping¶

d["month"] = [i.month for i in d.index]
d.head()

	Berri	Mneuve1	Mneuve2	Brebeuf	month
Date
2009-01-01 00:05:00	29	20	35	2576.359551	1
2009-01-02 00:05:00	14	2	2	2576.359551	1
2009-01-03 00:05:00	67	30	80	2576.359551	1
2009-01-04 00:05:00	0	0	0	2576.359551	1
2009-01-05 00:05:00	1925	1256	1501	2576.359551	1

d.groupby("month").max()

	Berri	Mneuve1	Mneuve2	Brebeuf
month
1	5298	2796	5765	6939.0
2	5451	2868	5517	7052.0
3	5904	3523	5762	7194.0
4	5278	3499	5327	5837.0
5	6028	4120	5397	7121.0
6	6320	3499	6047	5259.0
7	6100	3825	5536	7219.0
8	5452	2865	6379	7044.0
9	6626	4227	6535	7575.0
10	6274	4242	6587	7268.0
11	4864	2648	5895	6044.0
12	5538	2983	5107	7127.0

d.groupby("month").count()

	Berri	Mneuve1	Mneuve2	Brebeuf
month
1	31	31	31	31
2	28	28	28	28
3	31	31	31	31
4	30	30	30	30
5	31	31	31	31
6	30	30	30	30
7	31	31	31	31
8	31	31	31	31
9	30	30	30	30
10	31	31	31	31
11	30	30	30	30
12	31	31	31	31

Time series¶

observe we can establish at load time many thing if the dataset is relatively clean

tiempo=pd.read_csv('local/data/weather_data_austin_2010.csv',parse_dates=['Date'], dayfirst=True ,index_col='Date')
tiempo

	Temperature	DewPoint	Pressure
Date
2010-01-01 00:00:00	46.2	37.5	1.0
2010-01-01 01:00:00	44.6	37.1	1.0
2010-01-01 02:00:00	44.1	36.9	1.0
2010-01-01 03:00:00	43.8	36.9	1.0
2010-01-01 04:00:00	43.5	36.8	1.0
...	...	...	...
2010-12-31 19:00:00	51.1	38.1	1.0
2010-12-31 20:00:00	49.0	37.9	1.0
2010-12-31 21:00:00	47.9	37.9	1.0
2010-12-31 22:00:00	46.9	37.9	1.0
2010-12-31 23:00:00	46.2	37.7	1.0

8759 rows × 3 columns

tiempo.loc['2010-08-01':'2010-10-30']

	Temperature	DewPoint	Pressure
Date
2010-08-01 00:00:00	79.0	70.8	1.0
2010-08-01 01:00:00	77.4	71.2	1.0
2010-08-01 02:00:00	76.4	71.3	1.0
2010-08-01 03:00:00	75.7	71.4	1.0
2010-08-01 04:00:00	75.1	71.4	1.0
...	...	...	...
2010-10-30 19:00:00	65.4	53.6	1.0
2010-10-30 20:00:00	63.6	53.9	1.0
2010-10-30 21:00:00	62.2	53.8	1.0
2010-10-30 22:00:00	61.4	54.0	1.0
2010-10-30 23:00:00	60.3	53.9	1.0

2184 rows × 3 columns

tiempo.loc['2010-06'].head()

	Temperature	DewPoint	Pressure
Date
2010-06-01 00:00:00	74.0	67.9	1.0
2010-06-01 01:00:00	72.6	68.0	1.0
2010-06-01 02:00:00	72.0	67.9	1.0
2010-06-01 03:00:00	71.6	67.9	1.0
2010-06-01 04:00:00	71.1	67.7	1.0

tiempo.sample(10)

	Temperature	DewPoint	Pressure
Date
2010-12-24 20:00:00	48.9	38.6	1.0
2010-10-12 01:00:00	64.7	59.3	1.0
2010-02-21 03:00:00	49.3	42.8	1.0
2010-07-20 16:00:00	93.5	67.1	1.0
2010-11-04 12:00:00	70.4	53.0	1.0
2010-10-29 15:00:00	75.4	53.0	1.0
2010-07-18 17:00:00	92.5	67.3	1.0
2010-04-16 13:00:00	76.4	56.7	1.0
2010-07-05 06:00:00	74.4	71.2	1.0
2010-07-03 16:00:00	91.4	68.6	1.0

tiempo.sample(frac=0.01)

	Temperature	DewPoint	Pressure
Date
2010-12-22 23:00:00	45.9	37.8	1.0
2010-05-19 15:00:00	84.9	64.8	1.0
2010-02-24 12:00:00	61.9	44.5	1.0
2010-01-10 00:00:00	46.2	37.4	1.0
2010-12-21 19:00:00	50.7	38.8	1.0
...	...	...	...
2010-05-27 11:00:00	81.7	67.3	1.0
2010-08-11 03:00:00	75.8	71.3	1.0
2010-03-17 13:00:00	68.5	49.6	1.0
2010-10-28 14:00:00	75.2	53.9	1.0
2010-08-21 03:00:00	75.9	71.1	1.0

88 rows × 3 columns

Resampling¶

tiempo.head()

	Temperature	DewPoint	Pressure
Date
2010-01-01 00:00:00	46.2	37.5	1.0
2010-01-01 01:00:00	44.6	37.1	1.0
2010-01-01 02:00:00	44.1	36.9	1.0
2010-01-01 03:00:00	43.8	36.9	1.0
2010-01-01 04:00:00	43.5	36.8	1.0

tiempo.resample("5d").mean().head()

	Temperature	DewPoint	Pressure
Date
2010-01-01	49.720833	38.092500	1.0
2010-01-06	49.449167	37.575000	1.0
2010-01-11	49.222500	37.603333	1.0
2010-01-16	49.441667	37.650000	1.0
2010-01-21	50.683333	39.309167	1.0

tiempo.resample("5d").mean().head()

	Temperature	DewPoint	Pressure
Date
2010-01-01	49.720833	38.092500	1.0
2010-01-06	49.449167	37.575000	1.0
2010-01-11	49.222500	37.603333	1.0
2010-01-16	49.441667	37.650000	1.0
2010-01-21	50.683333	39.309167	1.0

tiempo.resample("5d").mean().head()

	Temperature	DewPoint	Pressure
Date
2010-01-01	49.720833	38.092500	1.0
2010-01-06	49.449167	37.575000	1.0
2010-01-11	49.222500	37.603333	1.0
2010-01-16	49.441667	37.650000	1.0
2010-01-21	50.683333	39.309167	1.0

tiempo.resample("30min").mean()[:15]

	Temperature	DewPoint	Pressure
Date
2010-01-01 00:00:00	46.2	37.5	1.0
2010-01-01 00:30:00	NaN	NaN	NaN
2010-01-01 01:00:00	44.6	37.1	1.0
2010-01-01 01:30:00	NaN	NaN	NaN
2010-01-01 02:00:00	44.1	36.9	1.0
2010-01-01 02:30:00	NaN	NaN	NaN
2010-01-01 03:00:00	43.8	36.9	1.0
2010-01-01 03:30:00	NaN	NaN	NaN
2010-01-01 04:00:00	43.5	36.8	1.0
2010-01-01 04:30:00	NaN	NaN	NaN
2010-01-01 05:00:00	43.0	36.5	1.0
2010-01-01 05:30:00	NaN	NaN	NaN
2010-01-01 06:00:00	43.1	36.3	1.0
2010-01-01 06:30:00	NaN	NaN	NaN
2010-01-01 07:00:00	42.3	35.9	1.0

subt=tiempo.between_time(start_time='1:00',end_time='12:00')
subt

	Temperature	DewPoint	Pressure
Date
2010-01-01 01:00:00	44.6	37.1	1.0
2010-01-01 02:00:00	44.1	36.9	1.0
2010-01-01 03:00:00	43.8	36.9	1.0
2010-01-01 04:00:00	43.5	36.8	1.0
2010-01-01 05:00:00	43.0	36.5	1.0
...	...	...	...
2010-12-31 08:00:00	42.5	36.1	1.0
2010-12-31 09:00:00	46.0	37.7	1.0
2010-12-31 10:00:00	49.4	38.0	1.0
2010-12-31 11:00:00	52.4	38.0	1.0
2010-12-31 12:00:00	54.7	37.9	1.0

4379 rows × 3 columns

tiempo.index.weekday

Int64Index([4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
            ...
            4, 4, 4, 4, 4, 4, 4, 4, 4, 4],
           dtype='int64', name='Date', length=8759)

tiempo.index.month

Int64Index([ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
            ...
            12, 12, 12, 12, 12, 12, 12, 12, 12, 12],
           dtype='int64', name='Date', length=8759)

tiempo.index.day

Int64Index([ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
            ...
            31, 31, 31, 31, 31, 31, 31, 31, 31, 31],
           dtype='int64', name='Date', length=8759)

tiempo.plot(style='.')

<matplotlib.axes._subplots.AxesSubplot at 0x7fb9e70c8250>

../_images/NOTES 02.04 - PANDAS_118_1.png

tiempo['2010-01'].plot()

<matplotlib.axes._subplots.AxesSubplot at 0x7fb9f2af66d0>

../_images/NOTES 02.04 - PANDAS_119_1.png

tiempo['2010-01-04'].plot()

<matplotlib.axes._subplots.AxesSubplot at 0x7fb9f2b8eed0>

../_images/NOTES 02.04 - PANDAS_120_1.png

Rolling operations¶

import pandas as pd
### permite obtener data frames directamente de internet
!pip install yfinance

Collecting yfinance
  Downloading https://files.pythonhosted.org/packages/c2/31/8b374a12b90def92a4e27d0fc595fc43635f395984e36a075244d98bd265/yfinance-0.1.54.tar.gz
Collecting pandas>=0.24
?25l  Downloading https://files.pythonhosted.org/packages/4a/6a/94b219b8ea0f2d580169e85ed1edc0163743f55aaeca8a44c2e8fc1e344e/pandas-1.0.3-cp37-cp37m-manylinux1_x86_64.whl (10.0MB)
     |████████████████████████████████| 10.0MB 453kB/s eta 0:00:01
?25hRequirement already satisfied: numpy>=1.15 in /opt/anaconda3/lib/python3.7/site-packages (from yfinance) (1.15.1)
Requirement already satisfied: requests>=2.20 in /opt/anaconda3/lib/python3.7/site-packages (from yfinance) (2.22.0)
Collecting multitasking>=0.0.7
  Downloading https://files.pythonhosted.org/packages/69/e7/e9f1661c28f7b87abfa08cb0e8f51dad2240a9f4f741f02ea839835e6d18/multitasking-0.0.9.tar.gz
Requirement already satisfied: pytz>=2017.2 in /opt/anaconda3/lib/python3.7/site-packages (from pandas>=0.24->yfinance) (2018.5)
Requirement already satisfied: python-dateutil>=2.6.1 in /opt/anaconda3/lib/python3.7/site-packages (from pandas>=0.24->yfinance) (2.7.3)
Requirement already satisfied: idna<2.9,>=2.5 in /opt/anaconda3/lib/python3.7/site-packages (from requests>=2.20->yfinance) (2.8)
Requirement already satisfied: certifi>=2017.4.17 in /opt/anaconda3/lib/python3.7/site-packages (from requests>=2.20->yfinance) (2019.9.11)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /opt/anaconda3/lib/python3.7/site-packages (from requests>=2.20->yfinance) (3.0.4)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/anaconda3/lib/python3.7/site-packages (from requests>=2.20->yfinance) (1.24.2)
Requirement already satisfied: six>=1.5 in /opt/anaconda3/lib/python3.7/site-packages (from python-dateutil>=2.6.1->pandas>=0.24->yfinance) (1.13.0)
Building wheels for collected packages: yfinance, multitasking
  Building wheel for yfinance (setup.py) ... ?25ldone
?25h  Created wheel for yfinance: filename=yfinance-0.1.54-py2.py3-none-any.whl size=22411 sha256=d937a7a089b0883844df4d4f388c8bcc4dc145446c43e698a5181e6ebd099621
  Stored in directory: /home/rlx/.cache/pip/wheels/f9/e3/5b/ec24dd2984b12d61e0abf26289746c2436a0e7844f26f2515c
  Building wheel for multitasking (setup.py) ... ?25ldone
?25h  Created wheel for multitasking: filename=multitasking-0.0.9-cp37-none-any.whl size=8368 sha256=bfe866419b7d2ac5e39bab8a20d2378ac727b56d6dc6e3f248e2a4115ea368bb
  Stored in directory: /home/rlx/.cache/pip/wheels/37/fa/73/d492849e319038eb4d986f5152e4b19ffb1bc0639da84d2677
Successfully built yfinance multitasking
Installing collected packages: pandas, multitasking, yfinance
  Found existing installation: pandas 0.23.4
    Uninstalling pandas-0.23.4:
      Successfully uninstalled pandas-0.23.4
Successfully installed multitasking-0.0.9 pandas-1.0.3 yfinance-0.1.54

import yfinance as yf

#define the ticker symbol
tickerSymbol = 'MSFT'

#get data on this ticker
tickerData = yf.Ticker(tickerSymbol)

#get the historical prices for this ticker
gs = tickerData.history(period='1d', start='2010-1-1', end='2020-1-25')

#see your data
gs

	Open	High	Low	Close	Volume	Dividends	Stock Splits
Date
2010-01-04	24.04	24.41	24.01	24.29	38409100	0.0	0
2010-01-05	24.22	24.41	24.05	24.30	49749600	0.0	0
2010-01-06	24.24	24.40	23.96	24.15	58182400	0.0	0
2010-01-07	24.04	24.10	23.70	23.90	50559700	0.0	0
2010-01-08	23.77	24.24	23.74	24.07	51197400	0.0	0
...	...	...	...	...	...	...	...
2020-01-17	166.96	167.01	164.98	166.64	34371700	0.0	0
2020-01-21	166.23	167.73	165.98	166.05	29517200	0.0	0
2020-01-22	166.94	167.03	165.23	165.25	24138800	0.0	0
2020-01-23	165.74	166.35	164.82	166.27	19680800	0.0	0
2020-01-24	167.05	167.07	164.00	164.59	24918100	0.0	0

2532 rows × 7 columns

gs.Close.rolling(10).mean().head(20)

Date
2010-01-04       NaN
2010-01-05       NaN
2010-01-06       NaN
2010-01-07       NaN
2010-01-08       NaN
2010-01-11       NaN
2010-01-12       NaN
2010-01-13       NaN
2010-01-14       NaN
2010-01-15    24.041
2010-01-19    24.053
2010-01-20    24.024
2010-01-21    23.965
2010-01-22    23.848
2010-01-25    23.742
2010-01-26    23.682
2010-01-27    23.651
2010-01-28    23.558
2010-01-29    23.340
2010-02-01    23.148
Name: Close, dtype: float64

plt.figure(figsize=(20,3))
plt.plot(gs.Close)
plt.plot(gs.Close.rolling(50).mean())

[<matplotlib.lines.Line2D at 0x7fb9f289bc90>]

../_images/NOTES 02.04 - PANDAS_127_1.png

plt.figure(figsize=(20,3))
plt.plot(gs.iloc[:400].Close, label="original")
plt.plot(gs.iloc[:400].Close.rolling(50).mean(), label="rolling")
plt.plot(gs.iloc[:400].Close.rolling(50, center=True).mean(), label="center")
plt.legend();

../_images/NOTES 02.04 - PANDAS_128_0.png

plt.figure(figsize=(20,3))
plt.plot(gs.iloc[:400].Close.rolling(10).mean())

[<matplotlib.lines.Line2D at 0x7fb9f26eff50>]

../_images/NOTES 02.04 - PANDAS_129_1.png

Inteligencia Artificial para las Ciencias e Ingenierías

02.04 - PANDAS

Contents

02.04 - PANDAS¶

`pandas` is mostly about manipulating tables of data¶

Pandas main object is a `DataFrame`¶

DataFrame indexing¶

Managing data¶

Fixing data¶

Filtering¶

Locating¶

Time series operations¶

Downsampling¶

Upsampling¶

Building Dataframes from other structures¶

`.values` access the underlying `numpy` structure¶

some out-of-the-box plotting¶

Grouping¶

Time series¶

Resampling¶

Rolling operations¶

	uno	dos	tres	cuatro	cinco
10	6	4	8	2	0
11	8	9	7	4	9
12	3	3	5	0	2
13	8	9	3	9	1
14	6	9	4	5	9
15	8	6	1	0	1
16	9	6	1	7	2
17	0	7	4	6	1
18	2	2	5	3	7
19	4	9	3	3	9
20	7	3	5	6	8
21	6	7	6	2	8
22	1	2	5	2	8
23	6	7	6	9	4
24	9	4	5	5	1
25	4	4	5	9	8
26	4	4	1	1	2
27	2	8	9	7	7
28	3	4	3	0	5
29	5	8	5	6	2

	uno	dos	tres	cuatro	cinco
10	6	4	8	2	0
11	8	9	7	4	9
12	3	3	5	0	2
13	8	9	3	9	1
14	6	9	4	5	9
15	8	6	1	0	1
16	9	6	1	7	2
17	0	7	4	6	1
18	2	2	5	3	7
19	4	9	3	3	9
20	7	3	5	6	8
21	6	7	6	2	8
22	1	2	5	2	8
23	6	7	6	9	4
24	9	4	5	5	1
25	4	4	5	9	8
26	4	4	1	1	2
27	2	8	9	7	7
28	3	4	3	0	5
29	5	8	5	6	2

Inteligencia Artificial para las Ciencias e Ingenierías

02.04 - PANDAS

Contents

02.04 - PANDAS¶

pandas is mostly about manipulating tables of data¶

Pandas main object is a DataFrame¶

DataFrame indexing¶

Managing data¶

Fixing data¶

Filtering¶

Locating¶

Time series operations¶

Downsampling¶

Upsampling¶

Building Dataframes from other structures¶

.values access the underlying numpy structure¶

some out-of-the-box plotting¶

Grouping¶

Time series¶

Resampling¶

Rolling operations¶

`pandas` is mostly about manipulating tables of data¶

Pandas main object is a `DataFrame`¶

`.values` access the underlying `numpy` structure¶

	uno	dos	tres	cuatro	cinco
10	6	4	8	2	0
11	8	9	7	4	9
12	3	3	5	0	2
13	8	9	3	9	1
14	6	9	4	5	9
15	8	6	1	0	1
16	9	6	1	7	2
17	0	7	4	6	1
18	2	2	5	3	7
19	4	9	3	3	9
20	7	3	5	6	8
21	6	7	6	2	8
22	1	2	5	2	8
23	6	7	6	9	4
24	9	4	5	5	1
25	4	4	5	9	8
26	4	4	1	1	2
27	2	8	9	7	7
28	3	4	3	0	5
29	5	8	5	6	2