Machine learning notes

To follow:
Spark http://spark.apache.org/community.html
Sean Owen https://www.quora.com/profile/Sean-Owen

https://parquet.apache.org/-> https://arrow.apache.org/
in-memory column data (and on disk) concept of pre-fetching, cache locality

https://en.wikipedia.org/wiki/Stochastic_gradient_descent
https://www.coursera.org/learn/machine-learning/lecture/9zJUs/mini-batch-gradient-descent

Exploit and Explore problem
https://en.wikipedia.org/wiki/Multi-armed_bandit

CAP theorem

https://deeplearning4j.org/ (+GPUs +spark +word2vec ) by https://skymind.io/

Chaos Monkey Army
Common architecture circuitbreakers
Latency monkey (chaos monkey)
https://github.com/Netflix/SimianArmy/wiki/The-Chaos-Monkey-Army

Google whitepaper on Beam
https://cloud.google.com/blog/big-data/2016/02/comparing-the-dataflowbeam-and-spark-programming-models
https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison

Akka streams
http://akka.io/docs/
http://www.cakesolutions.net/teamblogs/lifting-machine-learning-into-akka-streams

http://airflow.datasticks.com/admin/
comes from AirBnB job scheduler

2 read: https://code.facebook.com/posts/1671373793181703/apache-spark-scale-a-60-tb-production-use-case/

Nifi

Data provenance
https://en.wikipedia.org/wiki/Provenance

Alluxio (formerly Tachyon): http://www.alluxio.org/
(like Redis)

Kubernetes

Decouple producers from consumers
Resilience

https://console.cloud.google.com/projectselector/ml/models

Hive Metastore

Holden Karau and Rachel Warren. High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark

FS: https://www.gluster.org/

Spark on Yarn: HA Spark

https://prestodb.io/

Brendan Gregg. Systems Performance: Enterprise and the Cloud

cool visualisation
http://vectoross.io/
monitoring framework

https://github.com/jpmml/jpmml-spark

http://arturmkrtchyan.com/apache-spark-hidden-rest-api

Hystrix dashboard

Video Data
https://github.com/Hvass-Labs/TensorFlow-Tutorials/blob/master/09_Video_Data.ipynb
Youtube tutorials: https://www.youtube.com/playlist?list=PL9Hr9sNUjfsmEu1ZniY0XpHSzl5uihcXZ

Apache Spark partial
https://github.com/apache/spark/tree/master/core/src/main/scala/org/apache/spark/partial
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala

Data profiling Python:
scipy.stats.skew

To check types of EC2 instances
http://ec2instances.info
g2 are GPU instances
and p2 are bad boys

Nick Pentreath. Machine Learning with Spark

Automating Tinder
http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html
also http://www.bernie.ai/

http://vis-www.cs.umass.edu/lfw/
13,000+ face images database

Spark + Stanford CoreNLP (Sentiment)
Word2Vec
Neural network creates vector representation from words, no pre-processing (or some)