Intro to Hadoop - Big Data

Intro to Hadoop - Big Data

  1. What does IaaS provide?

    • Computing Environment

    • Hardware Only

    • Software On-Demand

  2. What does PaaS provide?

    • Software On-Demand

    • Computing Environment

    • Hardware Only

  3. What does SaaS provide?

    • Software On-Demand

    • Computing Environment

    • Hardware Only

  4. What are the two key components of HDFS and what are they used for?

    • NameNode for metadata and DataNode for block storage.

    • NameNode for block storage and Data Node for metadata.

    • FASTA for genome sequence and Rasters for geospatial data.

  5. What is the job of the NameNode?

    • Coordinate operations and assigns tasks to Data Nodes

    • Listens from DataNode for block creation, deletion, and replication.

    • For gene sequencing calculations.

  6. What is the order of the three steps to Map Reduce?

    • Map -> Reduce -> Shuffle and Sort

    • Shuffle and Sort -> Map -> Reduce

    • Shuffle and Sort -> Reduce -> Map

    • Map -> Shuffle and Sort -> Reduce

  7. What is a benefit of using pre-built Hadoop images?

    • Quick prototyping, deploying, and validating of projects.

    • Guaranteed hardware support.

    • Quick prototyping, deploying, and guaranteed bug free.

    • Less software choices to choose from.

  8. What are some examples of open-source tools built for Hadoop and what does it do?

    • Pig, for real-time and in-memory processing of big data.

    • Zookeeper, analyze social graphs.

    • Giraph, for SQL-like queries.

    • Zookeeper, management system for animal named related components.

  9. What is the difference between low level interfaces and high level interfaces?

    • Low level deals with storage and scheduling while high level deals with interactivity.

    • Low level deals with interactivity while high level deals with storage and scheduling.

  10. Which of the following are problems to look out for when integrating your project with Hadoop?

    • Advanced Alogrithms

    • Infrastructure Replacement

    • Task Level Parallelism

    • Random Data Access

    • Data Level Parallelism

  11. As covered in the slides, which of the following are the major goals of Hadoop?

    • Enable Scalability

    • Latency Sensitive Tasks

    • Facilitate a Shared Environment

    • Provide Value for Data

    • Optimized for a Variety of Data Types

    • Handle Fault Tolerance

  12. What is the purpose of YARN?

    • Allows various applications to run on the same Hadoop cluster.

    • Enables large scale data across clusters.

    • Implementation of Map Reduce.

  13. What are the two main components for a data computation framework that were described in the slides?

    • Resource Manager and Container

    • Resource Manager and Node Manager

    • Node Manager and Applications Master

    • Node Manager and Container

    • Applications Master and Container