07 november lecture of data platform
🔷 Note
small paper or submit own paper to the call for paper deadline
31/12/25
Why big data
Made data driven decision making based on analytics
Analytics effects
- descriptive
- diagnostic
- predictive
- prescriptive
- cognitive and artificial intelligence
Big data
big data are data where data volume or IOPS is not manageable by a normal processing capabilities (typical databases solutions)
characteristics
- volume
- variety structure or unstructured
- velocity how fast need to process data
- veracity quality of data collected
Long tail problem
The data needs to be collected without discrimination
Bigger and smarter ?
Collection means that send data is easy, moving and migrated them is not
Hardware concerns
Scaling up a single machine to manage big data implies a lot of problems Scale up vs scale out scale out, scale out is much simpler but problems like
Distributed filesystem for scale out solutions
In order to scale out resources for big data a distributed filesystem istributed filesystem is required to share the storage resources , one of the solution is HDFS
main features of Hadhoop
- HA
- cluster topology and multi-claster solution
- replication between data nodes
Relational Distributed databases
| pros | cons |
|---|---|
| relational algebra for query optimization | fixed schema that is difficult to update |
| difficult to scale out | |
| impedence mismatch | |
| consistency feature means more latency |
platforms for big data

this is microsoft implementation
Lambda architecture
Lambda architecture relies on 2 different paths to collect data:
- a speedy one for real time data manipulation
- a batch one for high quality data