Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions 02_activities/assignments/DC_Cohort/Assignment2.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,31 @@ The store wants to keep customer addresses. Propose two architectures for the CU
Your answer...
```

Type 1: Overwriting old data
Overwriting the old data seems to be the simplest solution because it just involves replacing the old data with new data if the person changes their address. The old address data are lost. Here, the number of rows reflects the number of customers, where each customer just has one row.

CUSTOMER_ADDRESS
customer_id (FK to customers)
street_address
city_or_town
province_or_state
postal_code
country

Type 2: Retaining changes
Retaining previous data increases the amount of information to store and handle, but it allows users to go backwards through the dataset and retreive information on old addresses, and potentially number of moves. The number of rows here reflects the number of customers plus the number of times customers have moved, where a single customer could have multiple rows if they have moved multiple times. Customers who continue to move will add more rows.

CUSTOMER_ADDRESS
customer_address_id (PK)
customer_id (FK to customers)
street_address
city_or_town
province_or_state
postal_code
country
is_this_current_address
date_address_updated

***

## Section 2:
Expand Down Expand Up @@ -193,3 +218,11 @@ Consider, for example, concepts of labour, bias, LLM proliferation, moderating c
```
Your thoughts...
```

Neural nets are just people all the way down

I think the ethical issues most prevalent in this story can be summarized as the hidden sources of neural network training, how far back you have to go in order to find the real sources of training data, and the introduction of subconscious biases at every additional step. The construction of a large training dataset that associates key words with images is a great premise and clearly benifits downstream processes involving AI, but the methods in doing so are critical. When the dataset is constructed by hundreds of thousands of people manually identifying pictures and associating them with keywords, the snap-judgments made by these people will carry through their own biases, stereotypes, and associations. Most of these are likely harmless, but it is inevitable that some will carry harmful connotations. In the situation discussed here, the problem was made worse by the incorporation of the Brown Corpus to construct the one million key word database, which was established in the 1960s and does not reflect cultural changes that have occurred since them. The result involved the inclusion of offensive terms that biased constructed AI algorithms to favour some of these outdated and harmful societal norms from the past, and emphasizes the need to ensure a system is present to fix these types of issues.

These systematic issues are unfortuantely common in AI construction, and neural networks are only as good as their training data. I took a course early in my PhD on biases in STEM and how to address them, and we frequently discussed how baises in training data can cause downstream affects. A famous example in every day life comes from motion activated appliances like soap dispensers or hand dryers, where some models were unable to detect the hands of people with dark skin tones. This issue occured because the staff that constructed and programmed the appliances almost entirely had light skin, and without trying, the programming and testing procedures biased their result against other skin tones. The issue highlights the importance of considering diversity in construction of technology, but also adequately testing it using diverse real-world data.

My big take-away from this article was the importance of ensuring the original sources of incorporated data are known and discussed honestly in the context of model construction, and potential biases in the original data are discussed. Such biases may be obvious, such as differences in social norms between the 1960s and 2020s, but can subtle though still impactful if data are collected today and intrinsic biases left unacounted for.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
178 changes: 155 additions & 23 deletions 02_activities/assignments/DC_Cohort/assignment2.sql
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,14 @@ The `||` values concatenate the columns into strings.
Edit the appropriate columns -- you're making two edits -- and the NULL rows will be fixed.
All the other rows will remain the same. */
--QUERY 1


SELECT
product_name || ', ' ||
COALESCE(product_size, '') ||
' (' ||
COALESCE(product_qty_type, 'unit') ||
')'
AS this_is_not_a_table
FROM product;


--END QUERY
Expand All @@ -40,8 +46,16 @@ each new market date for each customer, or select only the unique market dates p
HINT: One of these approaches uses ROW_NUMBER() and one uses DENSE_RANK().
Filter the visits to dates before April 29, 2022. */
--QUERY 2


SELECT
customer_id,
market_date,
ROW_NUMBER() OVER(
PARTITION BY customer_id
ORDER BY market_date
) AS visit_number
FROM customer_purchases
WHERE market_date < '2022-04-29'
ORDER BY customer_id, market_date;


--END QUERY
Expand All @@ -52,8 +66,37 @@ then write another query that uses this one as a subquery (or temp table) and fi
only the customer’s most recent visit.
HINT: Do not use the previous visit dates filter. */
--QUERY 3


-- This block reverses numbering so each customer's most recent visit is labeled 1
SELECT
customer_id,
market_date,
DENSE_RANK() OVER(
PARTITION BY customer_id
ORDER BY market_date DESC
) AS visit_number
FROM (
SELECT DISTINCT customer_id, market_date
FROM customer_purchases
)
ORDER BY customer_id, market_date DESC;

-- The second block uses the first as a subquery and filters the results
SELECT *
FROM ( -- establish a subquery
SELECT
customer_id,
market_date,
DENSE_RANK() OVER (
PARTITION BY customer_id
ORDER BY market_date DESC
) AS visit_number
FROM (
SELECT DISTINCT customer_id, market_date
FROM customer_purchases
)
) sub
WHERE visit_number = 1
ORDER BY customer_id;


--END QUERY
Expand All @@ -65,9 +108,17 @@ customer_purchases table that indicates how many different times that customer h
You can make this a running count by including an ORDER BY within the PARTITION BY if desired.
Filter the visits to dates before April 29, 2022. */
--QUERY 4



SELECT
customer_id,
product_id,
market_date,
COUNT(*) OVER (
PARTITION BY customer_id, product_id
ORDER BY market_date
) AS running_count_of_purchases
FROM customer_purchases
WHERE market_date < '2022-04-29'
ORDER BY customer_id, product_id, market_date;

--END QUERY

Expand All @@ -84,17 +135,29 @@ Remove any trailing or leading whitespaces. Don't just use a case statement for

Hint: you might need to use INSTR(product_name,'-') to find the hyphens. INSTR will help split the column. */
--QUERY 5



SELECT *
FROM product;

SELECT
product_name,
CASE
WHEN INSTR(product_name, '-') > 0 THEN
TRIM(SUBSTR(product_name, INSTR(product_name, '-') + 1))
ELSE NULL
END AS description
FROM product;

--END QUERY


/* 2. Filter the query to show any product_size value that contain a number with REGEXP. */
--QUERY 6
SELECT *
FROM product;


SELECT *
FROM product
WHERE product_size REGEXP '[0-9]'; -- shows rows with product_size that has a value 0 through 9 (basically, any numeric)


--END QUERY
Expand All @@ -110,8 +173,36 @@ HINT: There are a possibly a few ways to do this query, but if you're struggling
3) Query the second temp table twice, once for the best day, once for the worst day,
with a UNION binding them. */
--QUERY 7


WITH daily_sales AS (
SELECT -- Group the values by date
market_date,
SUM(quantity * cost_to_customer_per_qty) AS total_sales
FROM customer_purchases
GROUP BY market_date
),
ranked_days AS (
SELECT
market_date,
total_sales,
RANK() OVER (ORDER BY total_sales DESC) AS best_rank,
RANK() OVER (ORDER BY total_sales ASC) AS worst_rank
FROM daily_sales
)
-- First try to look at 'best days'
SELECT
market_date,
total_sales,
'Highest Sales Day' AS highest_or_lowest
FROM ranked_days
WHERE best_rank = 1
UNION
-- Then try to look at 'worst days'
SELECT
market_date,
total_sales,
'Lowest Sales Day' AS highest_or_lowest
FROM ranked_days
WHERE worst_rank = 1;


--END QUERY
Expand All @@ -131,8 +222,16 @@ Think a bit about the row counts: how many distinct vendors, product names are t
How many customers are there (y).
Before your final group by you should have the product of those two queries (x*y). */
--QUERY 8


SELECT
v.vendor_name,
p.product_name,
SUM(vi.original_price *5) AS total_revenue
FROM vendor_inventory vi
JOIN vendor v ON vi.vendor_id = v.vendor_id
JOIN product p ON vi.product_id = p.product_id
CROSS JOIN customer c
GROUP BY v.vendor_name, p.product_name
ORDER BY v.vendor_name, p.product_name;


--END QUERY
Expand All @@ -144,19 +243,32 @@ This table will contain only products where the `product_qty_type = 'unit'`.
It should use all of the columns from the product table, as well as a new column for the `CURRENT_TIMESTAMP`.
Name the timestamp column `snapshot_timestamp`. */
--QUERY 9
CREATE TABLE product_units AS
SELECT
*,
CURRENT_TIMESTAMP AS snapshot_timestamp
FROM product
WHERE product_qty_type = 'unit';

/* Now check if the product_units table matches product */
SELECT *
FROM product_units;


SELECT *
FROM product;

--END QUERY


/*2. Using `INSERT`, add a new row to the product_units table (with an updated timestamp).
This can be any product you desire (e.g. add another record for Apple Pie). */
--QUERY 10
INSERT INTO product_units (product_id, product_name, product_size, product_category_id, product_qty_type, snapshot_timestamp)
VALUES (69, 'Cherry Pie 2', '10"', 3, 'unit', CURRENT_TIMESTAMP);



-- Now check if it shows up
SELECT *
FROM product_units;

--END QUERY

Expand All @@ -166,8 +278,12 @@ This can be any product you desire (e.g. add another record for Apple Pie). */

HINT: If you don't specify a WHERE clause, you are going to have a bad time.*/
--QUERY 11
DELETE FROM product_units
WHERE product_id = 69;


-- Now check if it's deleted
SELECT *
FROM product_units;


--END QUERY
Expand All @@ -190,9 +306,25 @@ Finally, make sure you have a WHERE statement to update the right row,
you'll need to use product_units.product_id to refer to the correct row within the product_units table.
When you have all of these components, you can run the update statement. */
--QUERY 12
-- ADD THE COLUMN
ALTER TABLE product_units
ADD current_quantity INT;



UPDATE product_units
SET current_quantity = COALESCE(
(
SELECT vi.quantity
FROM vendor_inventory vi
WHERE vi.product_id = product_units.product_id
ORDER BY vi.rowid DESC
LIMIT 1
),
0
)
WHERE product_qty_type = 'unit';

SELECT *
FROM product_units;

--END QUERY

Expand Down