Wow. This project is off to a great start and is reusing today’s generation of ai models/techniques to explore alternative models for a new generation.
I am excited to see I’m not the only one fired up about addressing today’s model limitations like context size/window (https://github.com/arthurwolf/llmi/blob/main/README.md#recursive-redaction). Once we pop the weights out, we can reuse the weights in a new model configuration that has a larger context size (hopefully haha!).
Are you thinking about using a multimodal transformer for the “Thinking with code” section or something new and exciting I’ve never heard of (https://github.com/arthurwolf/llmi/blob/main/README.md#thinking-with-code)? I like the “Checking with Accuracy” section too (https://github.com/arthurwolf/llmi/blob/main/README.md#checking-for-accuracy), this is what I’m thinking of as a watermark for verifying a model’s at-rest weights have “trained knowledge” kind of like security scanning container images at rest in the CICD space vs verification the model answered the question(s) correctly while running/in-memory.
I could keep going, but what do you think are the next steps for your project?
Great questions! In the poc https://bampe-weights.readthedocs.io/en/latest/ I’m exploring can I extract weights from a larger, pretrained AI model’s weights (https://huggingface.co/gpt2/tree/main) and then reuse the predict smaller, subset of new weights for a hypothetical smaller model.
I think this approach works because we have many AI models with “good enough” answers already (a source of truth) that we can start exploring new ways to build them to reach parity with the current generation. I believe there is a way to hand-mold models as an individual without many gpu(s) and without higher-level math training by reusing today’s weights with today’s image-to-image transformers to answer/solve a subset from the original large, pretrained weights’ domain knowledge (uproven). Until I get the first small one reassembled, this is me just sharing the journey as-I-go type of post.
A large technical disadvantage: I think we need a new type of precision cutting tool to extract and recognize shapes inside tensor weight images, and I am initially thinking of using an embedding database/store (e.g. a modified postgres https://github.com/pgvector/pgvector) that performs the cosine similarity search over the embedded weights to do this (no gpu required). When compared to today’s paradigm for training and building models I have to reuse and search the entire internet for each answer, and I need gpu gear to run anything >30b because of how these models were foundationally-trained. I totally agree there’s a ton of disadvantages/risk with any new approach that is rebuilding something from the ground up (especially with this level of maths), but the poc shows today’s models can predict new weights without training and without entity extraction/ml and within 13-30 seconds the output is are not dramatically horrible vs the original source weights and we get a configurable-sized output chunk for reassembly that works without gpu (test chunk sizes ~0.7-11.8mb per chunk).