LAB 02.03 - Pandas
Contents
LAB 02.03 - Pandas¶
!wget --no-cache -O init.py -q https://raw.githubusercontent.com/rramosp/ai4eng.v1/main/content/init.py
import init; init.init(force_download=False); init.get_weblink()
from local.lib.rlxmoocapi import submit, session
session.LoginSequence(endpoint=init.endpoint, course_id=init.course_id, lab_id="L02.03", varname="student");
import numpy as np
import pandas as pd
Task 1: Extract data¶
we have a dataframe of items with a price and, sometimes, with an extra column (margin
).
You will have to complete a function that will filter the dataframe selecting the rows:
whose price is > 100
OR whose margin is >10, if the column
margin
is present
your function must return A LIST with the item ids of the selected rows
For instance, with the following dataframe
price category
itemid
56556 108.15 2.0
73065 83.54 2.0
36619 114.42 0.0
73414 82.94 0.0
13410 115.13 2.0
66153 91.50 1.0
77380 85.82 0.0
73249 95.95 0.0
11654 100.22 2.0
11972 77.16 1.0
your must return the following list:
[56556, 36619, 13410, 11654]
But if you get the following dataframe
price category margin
itemid
39059 98.11 0.0 11.04
19526 98.11 1.0 11.25
78176 94.34 1.0 10.51
50948 102.37 1.0 10.77
12111 98.07 1.0 8.50
56191 98.53 1.0 11.65
38887 91.49 2.0 11.24
77915 117.30 0.0 8.64
55010 96.13 0.0 8.95
45925 98.59 1.0 10.45
you must return the following list
[39059, 19526, 78176, 50948, 56191, 38887, 77915, 45925]
def create_df(missing=False, n=10):
itemid = np.random.randint(100000, size=n)+1000
category = np.random.randint(3, size=n)
price = np.round(np.random.normal(loc=100, scale=10, size=n),2)
margin = np.round(np.random.normal(loc=10, scale=1, size=n),2)
if missing:
nmissing = np.random.randint(len(price)//2)+2
price[np.random.permutation(len(price))[:nmissing]] = np.nan
d = pd.DataFrame(np.r_[[price, category, margin]].T, index=itemid, columns=["price", "category", "margin"])
d.index.name="itemid"
if np.random.random()>.5:
d = d[d.columns[:2]]
return d
d = create_df()
d
def select_items(df):
# make sure to make a copy in case you modify the original df
df = df.copy()
... # YOUR CODE HERE
result = ...
return result
manually check your answer
d = create_df()
d
select_items(d)
submit your code
student.submit_task(globals(), task_id="task_01");
Task 2: Group statistics¶
Complete the following function so that it returns a dataframe with the average, max and min prices per category.
For instance, for the following dataframe
price category margin
itemid
17946 93.85 1.0 10.64
61190 91.72 1.0 9.76
39639 100.16 1.0 10.67
17791 110.44 2.0 9.65
7333 101.05 1.0 9.69
77362 122.33 0.0 11.14
92646 108.13 2.0 10.58
27797 85.52 2.0 10.88
31746 97.56 0.0 9.75
12355 101.04 2.0 9.51
you should return the following dataframe
media maximo minimo
categoria
0 109.9450 122.33 97.56
1 96.6950 101.05 91.72
2 101.2825 110.44 85.52
observe that your result
must not be a multilevel columnset.
the column names and the index name must be exactly as in the example.
the index must be of type int.
def get_stats(df):
# make sure to make a copy in case you modify the original df
df = df.copy()
... # YOUR CODE HERE
result = ...
return result
manually check your answer
d = create_df()
d
get_stats(d)
submit your code
student.submit_task(globals(), task_id="task_02");
Task 3: Fill in missing data¶
Fill in the missing data in the price column with the following procedure:
compute the mean and std of the avaialable prices
sample from a normal distribution with the computed mean and std (see [
np.random.normal
])(https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html), as many samples as missing datasubstitute the missing values with the samples
For instance, for this input dataframe:
price category margin
itemid
18922 NaN 1.0 10.32
69500 121.25 1.0 10.22
76442 90.25 1.0 12.60
33863 106.51 0.0 10.26
15904 95.87 1.0 11.51
41946 103.47 2.0 9.85
85451 93.08 2.0 9.56
70028 116.68 1.0 9.11
26860 NaN 2.0 9.71
12807 91.48 0.0 9.77
your solution might be similar to this (not exactly the same as you will be sampling data):
price category margin
itemid
18922 97.441188 1.0 10.32
69500 121.250000 1.0 10.22
76442 90.250000 1.0 12.60
33863 106.510000 0.0 10.26
15904 95.870000 1.0 11.51
41946 103.470000 2.0 9.85
85451 93.080000 2.0 9.56
70028 116.680000 1.0 9.11
26860 103.294843 2.0 9.71
12807 91.480000 0.0 9.77
WARN: your function must not modify the original dataframe, make a copy of the input dataframe, fill the values in the copy and return it.
HINT: use the isna()
method of a dataframe or a series.
d = create_df(missing=True)
d
def fillna(df):
# make sure to make a copy in case you modify the original df
df = df.copy()
... # YOUR CODE HERE
result = ...
return result
check manually your code
d = create_df(missing=True)
d
fillna(d)
submit your code
student.submit_task(globals(), task_id="task_03");