- Lecture 1: Why NoSQL, Principles, Overview, Course organization – slides
- content: Motivation for NoSQL databases (Big Data, Big Users, Cloud Computing, Horizontal scalability); Value of Relational databases; General principles of NoSQL databases; Types of NoSQL databases (basic characteristics, uses cases, representatives); One example: Database technologies behind Facebook;
- covered terms: Big Data (Volume, Velocity, Variety), OLTP/OLAP/RTAP, RDBMS, ACID, Aggregate-oriented data models, Key-value stores, Document databases, Column-family stores, Graph databases
- Lecture 2: Distributed Computing with MapReduce – slides
- content: Distributed File Systems, Google File System (GFS), MapReduce programming model; MapReduce Framework; Apache Hadoop ecosystem; Apache Spark
- covered terms: Distributed File Systems: GFS, chunk server; MapReduce: Map, Combine, Grouping/Shuffling, Reduce; Hadoop Distributed File System (NameNode, DataNode, HeartBeat, BlockReport); Apache YARN, JobTracker, TaskTracker
- Lecture 3: Principles of NoSQL Databases: Data Model, Distribution & Consistency – slides
- content: Basic Principles of NoSQL Databases – Aggregate data model, horizontal scaling, relaxing consistency; Models of Data Distribution; Consistency in databases, transactions; Relaxing consistency in distributed databases – theories and techniques; relaxing durability;
- covered terms: aggregate data model, vertical/horizontal scalability (scaling up/out), sharding, replication (master-slave, peer-to-peer), read/write/replication consistency, CAP Theorem, eventual consistency, BASE, Quorums
-
- Lecture 4: Distributed Key-value Stores – slides
- content: Key challenges and solutions: data sharding, data balancing, replica management, management of nodes; Comparison of Individual Stores: features to consider, connecting to database;Fundamentals; Suitable Use Cases; Basic Example (Riak)
- covered terms: Amazon Dynamo, consistent hashing, virtual nodes, version stamps (counter, GUID, hash, timestamp), vector stamps (Lamport timestamps, vector clocks, version vectors, matrix clocks), anti-entropy, read repair, gossip protocols, two-phase commit protocol (2PC), multi-version concurrency control (MVCC), levels of isolation, skew write anomaly
- Lecture 5: Key-value Stores II: Embedded, Distributed, and In-memory Stores – slides
- content: embedded stores: LevelDB; Distributed key-value stores: Riak, Infinispan; in-memory caches: Memcached. Serialization: Protocol Buffers, Apache Thrift
- covered terms: Log-structured Merge-Tree (LSM Tree), SSTable; Riak Links, Indexes, Search; memory cache, data eviction, distributed transaction management (X/Open XA), Lucene (Solr); Memcached; object serialisation (marshalling), Protocol Buffers, Apache Thrift
- Lecture 6: Document Databases – slides
- content: Text Data Formats; Document Databases: Usage and Principles Behind, MongoDB: Data Models, Querying, Updates, Indexes, BSON, Distribution, MapReduce, Journaling, Transactions
- covered terms: JSON, XML; MongoDB
- Lecture 7 (4/9/2018): Column-family Stores – slides
- content: Column Family Data Model, System Architectures; Cassandra: CQL, Data Partitioning & Replication, Local Data Persistence, Queries
- covered terms: Google BigTable, Cassandra, HBase, column family, super columns, CQL, memtables, SSTable, lightweight transactions
- Lecture 8 (4/16/2018): Graph Databases – slides
- content: Graph Databases: Mission, Data, Example; Graph Theory: Representations, Data Locality, Graph Partitioning and Traversal; Types of Queries; Transactional Databases; Neo4j: Basics
- covered terms: Directed/undirected graphs, Adjacency Matrix, Adjacency List, Incidence Matrix, Laplacian Matrix; Breadth-first Search (BFS), BFS Layout, Bandwidth minimization problem, Graph Partitioning (1D, 2D); Sub-graph, Super-graph, Similarity Queries; (Non-)Mining-Based Graph Indexing Techniques; Neo4j
- Lecture 9-11 (4/23/2018, 4/30/2018, 5/7/2018): Presentation of Projects
- content: presentation of group projects
- Lecture 12 (14/5/2018): A Small Peek into Big Data Analytics
- by Václav Lorenc (Senior Security Analyst), Oracle | NetSuite
- annotation:
“Data is the new bacon! Splunk is the new Excel!†— Big data and data analytics in general are gaining momentum in contemporary world. But what is it really about? How difficult it is to start with data analytics? Can you start small with big data problems?
In the talk, you’ll get a very brief overview of a practical data analysis — basic knowledge and few tools will be described, as well as general motivation; all that combined with more or less funny stories from real world. We’ll focus both on technical and non-technical aspects of the data analytics, both equally important for day-to-day activities.