LAB 02.03 - Pandas

!wget --no-cache -O init.py -q https://raw.githubusercontent.com/rramosp/ai4eng.v1/main/content/init.py
import init; init.init(force_download=False); init.get_weblink()
from local.lib.rlxmoocapi import submit, session
session.LoginSequence(endpoint=init.endpoint, course_id=init.course_id, lab_id="L02.03", varname="student");
import numpy as np
import pandas as pd

Task 1: Extract data

we have a dataframe of items with a price and, sometimes, with an extra column (margin).

You will have to complete a function that will filter the dataframe selecting the rows:

  • whose price is > 100

  • OR whose margin is >10, if the column margin is present

your function must return A LIST with the item ids of the selected rows

For instance, with the following dataframe

         price  category
itemid                  
56556   108.15       2.0
73065    83.54       2.0
36619   114.42       0.0
73414    82.94       0.0
13410   115.13       2.0
66153    91.50       1.0
77380    85.82       0.0
73249    95.95       0.0
11654   100.22       2.0
11972    77.16       1.0

your must return the following list:

 [56556, 36619, 13410, 11654]

But if you get the following dataframe

         price  category  margin
itemid                          
39059    98.11       0.0   11.04
19526    98.11       1.0   11.25
78176    94.34       1.0   10.51
50948   102.37       1.0   10.77
12111    98.07       1.0    8.50
56191    98.53       1.0   11.65
38887    91.49       2.0   11.24
77915   117.30       0.0    8.64
55010    96.13       0.0    8.95
45925    98.59       1.0   10.45

you must return the following list

[39059, 19526, 78176, 50948, 56191, 38887, 77915, 45925]
def create_df(missing=False, n=10):
    itemid   = np.random.randint(100000, size=n)+1000
    category = np.random.randint(3, size=n)
    price    = np.round(np.random.normal(loc=100, scale=10, size=n),2)
    margin   = np.round(np.random.normal(loc=10, scale=1, size=n),2)
    
    if missing:
        nmissing = np.random.randint(len(price)//2)+2                                     
        price[np.random.permutation(len(price))[:nmissing]] = np.nan
    
    d = pd.DataFrame(np.r_[[price, category, margin]].T, index=itemid, columns=["price", "category", "margin"])
    d.index.name="itemid"
    if np.random.random()>.5:
        d = d[d.columns[:2]]
        
    return d
d = create_df()
d
def select_items(df):
    # make sure to make a copy in case you modify the original df
    df = df.copy()
    
    ... # YOUR CODE HERE
    
    result = ...
    
    return result

manually check your answer

d = create_df()
d
select_items(d)

submit your code

student.submit_task(globals(), task_id="task_01");

Task 2: Group statistics

Complete the following function so that it returns a dataframe with the average, max and min prices per category.

For instance, for the following dataframe

         price  category  margin
itemid                          
17946    93.85       1.0   10.64
61190    91.72       1.0    9.76
39639   100.16       1.0   10.67
17791   110.44       2.0    9.65
7333    101.05       1.0    9.69
77362   122.33       0.0   11.14
92646   108.13       2.0   10.58
27797    85.52       2.0   10.88
31746    97.56       0.0    9.75
12355   101.04       2.0    9.51

you should return the following dataframe

             media  maximo  minimo
categoria                         
0         109.9450  122.33  97.56
1          96.6950  101.05  91.72
2         101.2825  110.44  85.52

observe that your result

  • must not be a multilevel columnset.

  • the column names and the index name must be exactly as in the example.

  • the index must be of type int.

def get_stats(df):
    # make sure to make a copy in case you modify the original df
    df = df.copy()
        
    ... # YOUR CODE HERE    
    result = ...
    return result

manually check your answer

d = create_df()
d
get_stats(d)

submit your code

student.submit_task(globals(), task_id="task_02");

Task 3: Fill in missing data

Fill in the missing data in the price column with the following procedure:

For instance, for this input dataframe:

         price  category  margin
itemid                          
18922      NaN       1.0   10.32
69500   121.25       1.0   10.22
76442    90.25       1.0   12.60
33863   106.51       0.0   10.26
15904    95.87       1.0   11.51
41946   103.47       2.0    9.85
85451    93.08       2.0    9.56
70028   116.68       1.0    9.11
26860      NaN       2.0    9.71
12807    91.48       0.0    9.77

your solution might be similar to this (not exactly the same as you will be sampling data):

             price  category  margin
itemid                              
18922    97.441188       1.0   10.32
69500   121.250000       1.0   10.22
76442    90.250000       1.0   12.60
33863   106.510000       0.0   10.26
15904    95.870000       1.0   11.51
41946   103.470000       2.0    9.85
85451    93.080000       2.0    9.56
70028   116.680000       1.0    9.11
26860   103.294843       2.0    9.71
12807    91.480000       0.0    9.77

WARN: your function must not modify the original dataframe, make a copy of the input dataframe, fill the values in the copy and return it.

HINT: use the isna() method of a dataframe or a series.

d = create_df(missing=True)
d
def fillna(df):
    # make sure to make a copy in case you modify the original df
    df = df.copy()
    
    ... # YOUR CODE HERE    
    result = ...
    return result

check manually your code

d = create_df(missing=True)
d
fillna(d)

submit your code

student.submit_task(globals(), task_id="task_03");