Machine learning notes

To follow:
Sean Owen>
in-memory column data (and on disk) concept of pre-fetching, cache locality

Exploit and Explore problem

CAP theorem (+GPUs +spark +word2vec ) by

Chaos Monkey Army
Common architecture circuitbreakers
Latency monkey (chaos monkey)

Google whitepaper on Beam

Akka streams
comes from AirBnB job scheduler

2 read:


Data provenance

Alluxio (formerly Tachyon):
(like Redis)


Decouple producers from consumers

Hive Metastore

Holden Karau and Rachel Warren. High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark


Spark on Yarn: HA Spark

Brendan Gregg. Systems Performance: Enterprise and the Cloud

cool visualisation
monitoring framework

Hystrix dashboard

Video Data
Youtube tutorials:

Apache Spark partial

Data profiling Python:

To check types of EC2 instances
g2 are GPU instances
and p2 are bad boys

Nick Pentreath. Machine Learning with Spark

Automating Tinder
13,000+ face images database

Spark + Stanford CoreNLP (Sentiment)
Neural network creates vector representation from words, no pre-processing (or some)

