March 2, 2020

Continued Teaching

Now I preparing to teach a student about Data Warehouses. These are technologies that aggregate data from multiple sources so they can be compared and analyzed for various purposes. They can:

  1. Hold data for a long period of time
  2. optimize operations for reading data
  3. hold data for long periods of time
  4. Hold data that may lag and not be updated in real-time

There are many types of data warehouses like Vertica, Teradata, Oracle, and IBM. There is Apache Hive, a new open-source warehouse and the main one my student and I will be going over for this session. It part of the larger Hadoop ecosystem.

*Hadoop: distributed computing framework for processing millions of records. The process for Hadoop goes like this:

  1. Store millions of records in multiple machines
  2. Run processes on multiple machines to crunch data
  3. Handle fault tolerance/machine crashes
  4. Hive stores data in Hadoop process (data stored in files - text, binary) and partitioned across machines to prevent data loss

Written by tyler775

469 Views
Log in to Like
Log In to Favorite
Share on Facebook
Share on Twitter
Comments

You must be signed in to post a comment!