The Big Data Textbook
From clay tablets to lakehouses
The Big Data textbook is an ongoing effort to create a textbook with the content of the Big Data and Big Data for Engineers lectures taught at ETH Zurich.
The latest version can be found on ResearchGate.
It can be shared, but please only do so by giving the url https://ghislainfourny.github.io/big-data-textbook/
A second edition with the content as of August 30, 2024 is soon going to be available for purchase as a color printed copy or on Kindle on Amazon US, Amazon DE, and others (change the country code in the URL).
It also remains available as a free download with the latest updates. This way, educators can use this material with peace of mind, knowing that all their students have access.
Note that the RumbleDB engine, used in my courses at ETH Zurich for exercises and in the final exam, is also free. https://www.rumbledb.org/
Current content (second edition, 2024):
- Introduction and motivation
- Lessons learned and SQL brushup
- Cloud storage
- Distributed file systems
- Syntax
- Wide column stores
- Data modeling and validation
- Massive parallel processing (MapReduce)
- Resource management
- Generic dataflow processing (Spark)
- Document stores
- Querying denormalized data
-
Graph databases
Upcoming chapters planned for the next edition (already available on YouTube):
- Data warehouses and data cubes
- Wrap up
YouTube course recordings
All course recordings are available on YouTube
Big Data
Big Data targets an audience in Computer Science and Data Science Master’s programmes.
The lecture page can be found here
Big Data for Engineers
Big Data for Engineers targets a very broad audience in all other departments at the BSc, MSc and PhD level. The material is very similar, but spending more time explaining CS prerequisites. Some programming knowledge (such as Python) and knowledge of logic and algebra (sets, etc) is assumed.
The lecture page can be found here