LAB 02.03 - Pandas

LAB 02.03 - Pandas#

!wget --no-cache -O init.py -q https://raw.githubusercontent.com/rramosp/ai4eng.v1/main/content/init.py
import init; init.init(force_download=False); init.get_weblink()

from local.lib.rlxmoocapi import submit, session
session.LoginSequence(endpoint=init.endpoint, course_id=init.course_id, lab_id="L02.03", varname="student");

import numpy as np
import pandas as pd

Task 1: Extract data#

we have a dataframe of items with a price and, sometimes, with an extra column (margin).

You will have to complete a function that will filter the dataframe selecting the rows:

whose price is > 100
OR whose margin is >10, if the column margin is present

your function must return A LIST with the item ids of the selected rows

For instance, with the following dataframe

         price  category
itemid                  
 108.15       2.0
  83.54       2.0
 114.42       0.0
  82.94       0.0
 115.13       2.0
  91.50       1.0
  85.82       0.0
  95.95       0.0
 100.22       2.0
  77.16       1.0

your must return the following list:

 [56556, 36619, 13410, 11654]

But if you get the following dataframe

         price  category  margin
itemid                          
  98.11       0.0   11.04
  98.11       1.0   11.25
  94.34       1.0   10.51
 102.37       1.0   10.77
  98.07       1.0    8.50
  98.53       1.0   11.65
  91.49       2.0   11.24
 117.30       0.0    8.64
  96.13       0.0    8.95
  98.59       1.0   10.45

you must return the following list

[39059, 19526, 78176, 50948, 56191, 38887, 77915, 45925]

def create_df(missing=False, n=10):
    itemid   = np.random.randint(100000, size=n)+1000
    category = np.random.randint(3, size=n)
    price    = np.round(np.random.normal(loc=100, scale=10, size=n),2)
    margin   = np.round(np.random.normal(loc=10, scale=1, size=n),2)
    
    if missing:
        nmissing = np.random.randint(len(price)//2)+2                                     
        price[np.random.permutation(len(price))[:nmissing]] = np.nan
    
    d = pd.DataFrame(np.r_[[price, category, margin]].T, index=itemid, columns=["price", "category", "margin"])
    d.index.name="itemid"
    if np.random.random()>.5:
        d = d[d.columns[:2]]
        
    return d

d = create_df()
d

def select_items(df):
    # make sure to make a copy in case you modify the original df
    df = df.copy()
    
    ... # YOUR CODE HERE
    
    result = ...
    
    return result

manually check your answer

d = create_df()
d

select_items(d)

submit your code

student.submit_task(globals(), task_id="task_01");

Task 2: Group statistics#

Complete the following function so that it returns a dataframe with the average, max and min prices per category.

For instance, for the following dataframe

         price  category  margin
itemid                          
  93.85       1.0   10.64
  91.72       1.0    9.76
 100.16       1.0   10.67
 110.44       2.0    9.65
  101.05       1.0    9.69
 122.33       0.0   11.14
 108.13       2.0   10.58
  85.52       2.0   10.88
  97.56       0.0    9.75
 101.04       2.0    9.51

you should return the following dataframe

             media  maximo  minimo
categoria                         
0         109.9450  122.33  97.56
1          96.6950  101.05  91.72
2         101.2825  110.44  85.52

observe that your result

must not be a multilevel columnset.
the column names and the index name must be exactly as in the example.
the index must be of type int.

def get_stats(df):
    # make sure to make a copy in case you modify the original df
    df = df.copy()
        
    ... # YOUR CODE HERE    
    result = ...
    return result

manually check your answer

d = create_df()
d

get_stats(d)

submit your code

student.submit_task(globals(), task_id="task_02");

Task 3: Fill in missing data#

Fill in the missing data in the price column with the following procedure:

compute the mean and std of the avaialable prices
sample from a normal distribution with the computed mean and std (see [np.random.normal])(https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html), as many samples as missing data
substitute the missing values with the samples

For instance, for this input dataframe:

         price  category  margin
itemid                          
    NaN       1.0   10.32
 121.25       1.0   10.22
  90.25       1.0   12.60
 106.51       0.0   10.26
  95.87       1.0   11.51
 103.47       2.0    9.85
  93.08       2.0    9.56
 116.68       1.0    9.11
    NaN       2.0    9.71
  91.48       0.0    9.77

your solution might be similar to this (not exactly the same as you will be sampling data):

             price  category  margin
itemid                              
  97.441188       1.0   10.32
 121.250000       1.0   10.22
  90.250000       1.0   12.60
 106.510000       0.0   10.26
  95.870000       1.0   11.51
 103.470000       2.0    9.85
  93.080000       2.0    9.56
 116.680000       1.0    9.11
 103.294843       2.0    9.71
  91.480000       0.0    9.77

WARN: your function must not modify the original dataframe, make a copy of the input dataframe, fill the values in the copy and return it.

HINT: use the isna() method of a dataframe or a series.

d = create_df(missing=True)
d

def fillna(df):
    # make sure to make a copy in case you modify the original df
    df = df.copy()
    
    ... # YOUR CODE HERE    
    result = ...
    return result

check manually your code

d = create_df(missing=True)
d

fillna(d)

submit your code

student.submit_task(globals(), task_id="task_03");

LAB 02.03 - Pandas

Contents

LAB 02.03 - Pandas#

Task 1: Extract data#

Task 2: Group statistics#

Task 3: Fill in missing data#