To follow:
Spark http://spark.apache.org/community.html
Sean Owen https://www.quora.com/profile/Sean-Owen
https://parquet.apache.org/-> https://arrow.apache.org/
in-memory column data (and on disk) concept of pre-fetching, cache locality
https://en.wikipedia.org/wiki/Stochastic_gradient_descent
https://www.coursera.org/learn/machine-learning/lecture/9zJUs/mini-batch-gradient-descent
Exploit and Explore problem
https://en.wikipedia.org/wiki/Multi-armed_bandit
https://deeplearning4j.org/ (+GPUs +spark +word2vec ) by https://skymind.io/
Chaos Monkey Army
Common architecture circuitbreakers
Latency monkey (chaos monkey)
https://github.com/Netflix/SimianArmy/wiki/The-Chaos-Monkey-Army
Google whitepaper on Beam
https://cloud.google.com/blog/big-data/2016/02/comparing-the-dataflowbeam-and-spark-programming-models
https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison
Akka streams
http://akka.io/docs/
http://www.cakesolutions.net/teamblogs/lifting-machine-learning-into-akka-streams
http://airflow.datasticks.com/admin/
comes from AirBnB job scheduler
2 read: https://code.facebook.com/posts/1671373793181703/apache-spark-scale-a-60-tb-production-use-case/
Nifi
Data provenance
https://en.wikipedia.org/wiki/Provenance
Alluxio (formerly Tachyon): http://www.alluxio.org/
(like Redis)
Kubernetes
Decouple producers from consumers
Resilience
https://console.cloud.google.com/projectselector/ml/models
Hive Metastore
Holden Karau and Rachel Warren. High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark
Spark on Yarn: HA Spark
Brendan Gregg. Systems Performance: Enterprise and the Cloud
cool visualisation
http://vectoross.io/
monitoring framework
https://github.com/jpmml/jpmml-spark
http://arturmkrtchyan.com/apache-spark-hidden-rest-api
Hystrix dashboard
Video Data
https://github.com/Hvass-Labs/TensorFlow-Tutorials/blob/master/09_Video_Data.ipynb
Youtube tutorials: https://www.youtube.com/playlist?list=PL9Hr9sNUjfsmEu1ZniY0XpHSzl5uihcXZ
Apache Spark partial
https://github.com/apache/spark/tree/master/core/src/main/scala/org/apache/spark/partial
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala
Data profiling Python:
scipy.stats.skew
To check types of EC2 instances
http://ec2instances.info
g2 are GPU instances
and p2 are bad boys
Nick Pentreath. Machine Learning with Spark
Automating Tinder
http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html
also http://www.bernie.ai/
http://vis-www.cs.umass.edu/lfw/
13,000+ face images database
Spark + Stanford CoreNLP (Sentiment)
Word2Vec
Neural network creates vector representation from words, no pre-processing (or some)