Process Data from Dirty to Clean - Module 4 challenge

Process Data from Dirty to Clean - Module 4 challenge

  1. Fill in the blank: A data scientist keeps code for data analysis pipelines in a _____, which enables them to track the evolution of the pipelines over time.

    • changelog

    • dashboard

    • dataset

    • version control system

  2. You are a data analyst working for an e-commerce company. You have just finished cleaning data from a customer survey whose original objective was to gather feedback on specific usability issues within the mobile app checkout process. While reviewing the cleaned data and preparing for verification, you notice that a large number of respondents complain about slow website loading times on desktop computers, which was not related to the mobile app checkout.

    Based on the principles of data verification and taking the "big picture view," what is the most critical first step you should take in this situation?

    • Remove from the dataset all survey responses that mention desktop loading times.

    • Immediately begin a separate analysis of the website loading time comments, as this appears to be a significant customer pain point.

    • Ask a teammate to double-check your cleaning process to ensure no data related to desktop loading times was accidentally duplicated or introduced.

    • Pause and reassess whether focusing on the desktop loading time comments aligns with the original project goal centered on the mobile app checkout usability.

  3. Which SQL clause will consider a condition and return a value when that condition is met?

    • CASE column_name = 'condition' THEN 'value' END

    • WHEN column_name = 'condition' CASE 'value' END

    • WHEN

    • CASE

  4. What is the process of tracking changes, additions, deletions, and errors during data cleaning?

    • Documentation

    • Observation

    • Cataloging

    • Recording

  5. During verification, you notice an error in a dataset. You remember fixing a similar error when previously cleaning the data. What tool can you reference to find documentation about how to fix the error?

    • Notepad

    • Changelog

    • Data table

    • Text editor

  6. A junior data analyst notices an error in a product ID number in their dataset. Using a pivot table in Google Sheets, what function will let them know how many times this error occurs within the dataset?

    • CONCAT

    • CHECK

    • COUNTA

    • CASE

  7. You are a data analyst responsible for cleaning sales data entered manually by the sales team into a shared spreadsheet. Month after month, you notice and document in your cleaning report that a significant number of entries for Product_ID contain typos (e.g., transposing digits, using O instead of 0, etc.). These errors require considerable time to identify and correct during the cleaning process before the data can be used for reporting.

    During the feedback phase of the data cleaning process, what would be the most effective recommendation to propose to stakeholders or the process owner?

    • To implement a change in the data entry process, such as using a dropdown menu for Product_ID in the spreadsheet or adding a validation rule, to prevent these typos at the source.

    • To allocate more time for data cleaning each month specifically to handle the Product_ID typos.

    • To build an automated script that runs after data entry each month to specifically find and fix the common Product_ID typos before analysis begins.

    • To create a comprehensive guide documenting all known Product_ID typos and distribute it to the sales team, asking them to be more careful.

  8. Fill in the blank: To update a client's last name in their spreadsheet, a data professional uses _____ to search for any instance of “Reynolds” and change it to “Mehta.”

    • formatting

    • TRIM

    • find and replace

    • Remove duplicates