Notes On Software 2.0
Some of my notes from Andrej Karpathy’s Article on Medium here
Software 2.0
Neural Networks = Software 2.0
Software 1.0 consists of explicit instructions written in C++, Python etc
2.0 consists of “weights” or probabilities, not a human friendly language
Specify some goal on the behavior of a desirable out come
Create a data set of input out put pairs
The Neural Net Architecture is still explicitly written as the “skeleton”of the code
it Identifies a subset of program space to search
the search can be optimized with back propagation and stochastic gradient descent
Back Propagation:
can be thought of as a class of algorithms
computes the gradient of the loss function with respect to the weight of a single input output pair.
this is an example of dynamic programming
Gradient:
In vector calculus it is the gradient of Scalar-value differential function of several variables
Its is represented as a vector field
Loss Function:
function that maps values into a real number intuitively representing the “cost” of an event
Dynamic Programming:
mathematical optimization method and a computer programming method
simplifies a complex problem into simpler sub problems in a recursive manner.
1.0 consists of human engineered code that compiles into binary
2.0 consists of the data set that defines desirable behavior & the network architecture
the weights are all filled in by the model
this data set is compiled into binary
*Neural Network Architectures and training systems are increasingly standardized into a commodity.
?Is this like Tensor Flow and PyTorch?
So, 2.0 software development is mostly curating, growing, massaging and cleaning labeled data sets
?how does this relate to MuZero or Open Pilot?
?What does Hotz mean by saying everything will be “end to end ML”?
1.0 Maintains surrounding training code, infrastructure, analytics, visualizations and labeling interfaces
Common theme in software is to convert code bases to 2.0 and or even solve novel problems with 2.0, giving up on 1.0.
use cases:
Visual Recognition
Speech Recognition
Gaussian Mixture Model
Statistical probabilistic model for predicting sub populations in a larger population. No observed data set is needed to identify the sub pop
Markov Models
Stochastic model of pseudo random changing systems. Future state is only dependent on current state.
Speech Synthesis:
ConvNets Like WaveNet
Machine Translation
Games:
like AlphaGo
Data Bases:
Replaces traditional data management systems, out performing cache optimized B-trees
B-tree is a self balancing data structure.
Convolutional Neural Networks (ConvNets):
class of deep neural networks
commonly used to analyze visual imagery
regularized versions of multi layer perceptrons
modeled after the visual cortex in the brain
less pre processing when compared to image classification algorithms
optimizes filters through automated learning
“hallucinating” images, sounds, and text with generative models
“One Model to rule them all”
Computationally Homogenous:
NN’S are primarily 2 operations
Matrix multiplication & thresholding at zero (ReLU)
Thresholding:
Separate out regions of an image to analyze, i.e.: differentiate pixels
..Classical software is more heterogeneous and complex.
Instruction set of a NN is relatively small, implementation is “much closer to silicon” especially with ASICS and Neuromorphic chips.
neuromorphic chips are meant to mimic neuro biological architectures
Constant Run Time:
all NN foreward passes take the same amount of flops, there is no complex path through large codebases.
no need for dynamic compute graphs
less likely to be caught in infinite loops
Memory Use:
There is no dynamically allocated memory, low possibility of swapping to disk.
Portable:
Sequence of matrix multiplies is significantly easier to run on arbitrary configurations as compared to binaries to scripts.
Agility:
Easy to modify is runtime at cost of performance. Simply remove channels. channels are essentially all made of the same architecture. The same ease of use is true for adding channels with more data.
Limitations:
Explainability: At the end of the optimization it is hard to tell how they work. this is an ongoing topic in ai (clarity). Silently adopt biases
Takeaway: 2.0 can be viewed as another tool within 1.0 (NN’s) or can be understood as a whole new paradigm in itself (essentially what Karpathy argues). There remains much work to be done to support the new stack for example IDE’s, Repositories, tooling etc.
This trend seems intuitive considering that more information exists every day and we (humans) will not be getting better at understanding it anytime soon.