LAB 06.01 - Clustering companies

LAB 06.01 - Clustering companies#

!wget --no-cache -O init.py -q https://raw.githubusercontent.com/rramosp/ai4eng.v1/main/content/init.py
import init; init.init(force_download=False); init.get_weblink()

from local.lib.rlxmoocapi import submit, session
session.LoginSequence(endpoint=init.endpoint, course_id=init.course_id, lab_id="L06.01", varname="student");

Dataset#

observe the following dataset with daily stock data from different companies

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import Image
%matplotlib inline

d = pd.read_csv("local/data/company-stock-movements-2010-2015-incl.csv.gz", index_col=0)
d.shape

(60, 963)

d.head()

	2010-01-04	2010-01-05	2010-01-06	2010-01-07	2010-01-08	2010-01-11	2010-01-12	2010-01-13	2010-01-14	2010-01-15	...	2013-10-16	2013-10-17	2013-10-18	2013-10-21	2013-10-22	2013-10-23	2013-10-24	2013-10-25	2013-10-28	2013-10-29
Apple	0.580000	-0.220005	-3.409998	-1.170000	1.680011	-2.689994	-1.469994	2.779997	-0.680003	-4.999995	...	0.320008	4.519997	2.899987	9.590019	-6.540016	5.959976	6.910011	-5.359962	0.840019	-19.589981
AIG	-0.640002	-0.650000	-0.210001	-0.420000	0.710001	-0.200001	-1.130001	0.069999	-0.119999	-0.500000	...	0.919998	0.709999	0.119999	-0.480000	0.010002	-0.279998	-0.190003	-0.040001	-0.400002	0.660000
Amazon	-2.350006	1.260009	-2.350006	-2.009995	2.960006	-2.309997	-1.640007	1.209999	-1.790001	-2.039994	...	2.109985	3.699982	9.570008	-3.450013	4.820008	-4.079986	2.579986	4.790009	-1.760009	3.740021
American express	0.109997	0.000000	0.260002	0.720002	0.190003	-0.270001	0.750000	0.300004	0.639999	-0.130001	...	0.680001	2.290001	0.409996	-0.069999	0.100006	0.069999	0.130005	1.849999	0.040001	0.540001
Boeing	0.459999	1.770000	1.549999	2.690003	0.059997	-1.080002	0.360000	0.549999	0.530002	-0.709999	...	1.559997	2.480003	0.019997	-1.220001	0.480003	3.020004	-0.029999	1.940002	1.130005	0.309998

5 rows × 963 columns

d.index

Index(['Apple', 'AIG', 'Amazon', 'American express', 'Boeing',
       'Bank of America', 'British American Tobacco', 'Canon', 'Caterpillar',
       'Colgate-Palmolive', 'ConocoPhillips', 'Cisco', 'Chevron',
       'DuPont de Nemours', 'Dell', 'Ford', 'General Electrics',
       'Google/Alphabet', 'Goldman Sachs', 'GlaxoSmithKline', 'Home Depot',
       'Honda', 'HP', 'IBM', 'Intel', 'Johnson & Johnson', 'JPMorgan Chase',
       'Kimberly-Clark', 'Coca Cola', 'Lookheed Martin', 'MasterCard',
       'McDonalds', '3M', 'Microsoft', 'Mitsubishi', 'Navistar',
       'Northrop Grumman', 'Novartis', 'Pepsi', 'Pfizer', 'Procter Gamble',
       'Philip Morris', 'Royal Dutch Shell', 'SAP', 'Schlumberger', 'Sony',
       'Sanofi-Aventis', 'Symantec', 'Toyota', 'Total',
       'Taiwan Semiconductor Manufacturing', 'Texas instruments', 'Unilever',
       'Valero Energy', 'Walgreen', 'Wells Fargo', 'Wal-Mart', 'Exxon',
       'Xerox', 'Yahoo'],
      dtype='object')

TASK 1: convert `d` into a signed dataframe#

turn all values to 1 if >0 and -1 otherwise

the resulting dataset must look like the following image

HINT: use numpy.sign

Image("local/imgs/stock_signed.png")

../_images/00cef4253a651128076376aa8a02bc46f41b5a2da6f4c81e91c0a44693bbb6bb.png

def signed(d):
    result = ....
    return result

manually check your code

signed(d)

submit your code

student.submit_task(globals(), task_id="task_01");

Exercise 2: implement custom metric#

the intuition is that we want our metric to reward sets of symbols moving in sync.

So, given a matrix:

for each column compute the max number of positions with equal value.
divide by the number of elements in the column
average the value for all columns

for instance, for the following matrix

m  = np.array([[-1,  1,  1, -1],
               [-1, -1,  1, -1],
               [-1, -1,  1,  1],
               [-1, -1, -1,  1],
               [-1,  1,  1, -1],
               [-1, -1,  1, -1]])

the number of positions of equal value for each column is 6,4,5,4
normalizing by the number of elements per column (6) is 1,0.6667,.8333,.6667
the mean of the previous result is about 0.792

complete the following function to compute this metric, assuming the input matrix m is a numpy array containing only values of -1 and 1

suggested strategy: loop over the columns, compute the number of 1’s and -1’s and keep whichever is grater in each column, divide by the length of m (you will have then one number per column), and take the average.

challenge 1: solve it with one line

challenge 2: solve it without a loop, but using the axis argument when using np.sum and np.max

def mean_sync_move(m):
    result = ...
    return result

manually check your code

mean_sync_move(m)

the following must return 1

mean_sync_move(np.ones((10,2)))

obtain the mean_sync_move of the full dataset AFTER converting it into signed. It should be around 0.72

mean_sync_move(signed(d).values)

submit your code

student.submit_task(globals(), task_id="task_02");

Task 3: Cluster the full dataset#

complete the following function such that upon receiving a dataset:

obtains the signed dataset
uses sklearn KMeans to cluster the dataset with the given number of clusters (n_clusters)
predicts the cluster number of each company
for each cluster number:
- filters the signed dataset so that it keeps the companies belonging to that cluster.
- computes the mean_sync_move of the resulting filtered signed dataset.

your function must return a dataframe with one row per cluster containing

the cluster number
the number of companies in the cluster
the mean_sync_move of each cluster

for instance, for 4 clusters your resulting dataframe should look like the following one. The cluster numbers might be in a different order, but the index and the columns MUST be as illustrated.

important: use random_state=2 when creating the KMeans instance.

Image(filename='local/imgs/labclusters.png')

../_images/9b9329835dfdcc6784ecc14958f4d81de19de0b6594030b0cb08b43fb1204061.png

from sklearn.cluster import KMeans

def cluster_dataset(d, n_clusters):
    
    def signed(d):
        return ... # your code here
    
    def mean_sync_move(m):
        return ... # your code here
    
    d = signed(d)
    
    c = KMeans(....)
    y = c.fit_predict(...)
    r = pd.DataFrame(...., columns=["cluster", "nb_companies", "mean_sync_move"])    
   
    r.index = r.cluster.values.astype(int)
    r.index.name = "cluster"
    
    del(r["cluster"])
    return r

cluster_dataset(d, n_clusters=4)

submit your code

student.submit_task(globals(), task_id="task_03");

Compare with PCA#

we apply PCA with 2 components to +1/-1 dataset and visualize clusters

from sklearn.decomposition import PCA

n_clusters = 7

X = PCA(n_components=2).fit_transform(signed(d).values)
y = KMeans(n_clusters=10).fit_predict(signed(d).values)

cmap = plt.cm.hot
plt.figure(figsize=(20,10))
plt.scatter(X[:,0], X[:,1], color=cmap((y*255./(n_clusters-1)).astype(int)), s=100, edgecolor="black", lw=2)
for i in range(len(d)):
    name = d.index[i]
    plt.text(X[i,0]+.1, X[i,1]+.1,d.index[i], fontsize=14)
plt.axis("off");

../_images/d7bf05b0d511764adb43fc0203129eef7aeef75eb68414bf1593384929d5a9c1.png

Other suggested exercises#

Try the following exercises from O’Reilly’s Python Data Science Handbook, and its associated Notebook Repository at Chapter about Clustering :

Clustering, Exercise 1: k-Means on Digits
Clustering, Exercise 2: k-Means for Color Compression

LAB 06.01 - Clustering companies

Contents

LAB 06.01 - Clustering companies#

Dataset#

TASK 1: convert d into a signed dataframe#

Exercise 2: implement custom metric#

Task 3: Cluster the full dataset#

Compare with PCA#

Other suggested exercises#

TASK 1: convert `d` into a signed dataframe#