AI Information Doc
By u/Which-Tomato-8646 on Reddit
If you have any comments or suggestions, feel free to message me through the Messages inbox! (I don’t check DMs often)
All tweets/Reddit posts used in this document are from experts and researchers in machine learning or provide verifiable and plausible information.
If you would like to support my work, you can make a donation to $DocWriter on CashApp or @DocWriter on Venmo
Published on 5/12/2024
Last updated: 9/18/2024
Table of Contents:
1. General
2. AI Is Not A Stochastic Parrot/AI Is Original/AI Can Reason
2.1. AI Can Intentionally Deceive
2.2. AI Art is Unique
2.3. AI Consciousness
2.3.1. Expert Testimonies
2.4. New Discoveries
2.5. Awareness of Truth
2.6. LLMs Can Plan
3. AI Is Not Plateauing
3.1. Benchmarks
3.2. New Research
3.3. Hardware Improvements
3.4. Recent Releases
3.5. Expert Testimonies
3.6. Recursive Self Improvement
4. AI Is Useful
4.1. Media Creation
4.2. Corporate Use
4.3. Medical Use
4.4. Research Use
4.5. Military Use
4.6. Robotics
4.7. Engineering/Design
4.8. Writing
4.9. Helping People
4.10. Persuasion
5. AI Can Replace Jobs
5.1. Robotics
6. AI Can Code
6.1. Practical Use/Software Engineering
6.2. Research
6.3. Feats
7. AI Is Not Low Effort
8. AI Is Reliable/Addressing Hallucinations
8.1. Math
8.2. Medicine
9. Morality/AI Is Not Theft
10. Legality
11. AI Art
11.1. Images/Videos/3D Modeling
11.2. Quality/Soul
11.3. Glaze/Nightshade
11.4. Music
11.5. Artists Who Support or Use AI
11.6. Anti-AI Hypocrisy/False Accusations of AI Usage
11.6.1. Piracy Advocacy
11.6.2. Criticism of Copyright Enforcement
11.6.3. Theft Supported By Artists
11.6.4. False Accusations of Artists Using AI
11.7. Historical Complaints About Technology
12. Debunks
12.1 Articles/Videos/Studies
12.1.1. Study that ChatGPT fails 52% of coding tasks
12.1.2 Google’s Search AI Summaries
12.1.3 Debunk of “Has Generative AI Already Peaked?” by Computerphile (or the paper “No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance”)
12.1.4 “Vision language models are blind” Study
12.1.5. Real Photograph “Won” AI Art Competition
12.1.6. ChatGPT Plagiarized NYT Articles
12.1.7. Government Study Finds AI worse than humans in every way at summarizing information
12.1.8. “Generative AI's Illusory Case for Fair Use” Study
12.2 “AI is bad at math”
12.3 “Out-of-touch bosses and managers are forcing workers to use AI even if it is unnecessary, ineffective, or even harmful.”
12.4 “LLMs always agree with the user, even when they are wrong”
12.5. “LLMs will level out at human level”
12.6. “AI Should Be Doing My Dishes and Laundry Instead”
12.7. Goldman Sachs Report On AI Being Overhyped
12.8. HIVE AI Detector
12.9. LLMs Can’t Count Letters/LLMs Can’t Compare Numbers/LLMs Can’t Solve Riddles
12.10. LLMs Can’t Learn Continuously
12.11. Apple is Pessimistic On AI
12.12, AI Companies Are Trainng On Benchmarks to Inflate Their Scores
13. Energy Use/Water Use/Environmental Impact/Cost/Sustainability
13.1. AI Is Becoming More Efficient
14. AI Inbreeding/AI Training Off Its Own Output/AI Running Out Of Data
14.1 AI Image Training
15. AI Achievements
15.1. Jobs
15.2. Medicine
15.3. Art/Music/Literature
15.4. Coding/Computer Science
15.5. Math
16. Change Log for Past Week (In PST)
9/10/24
9/11/24
9/12/24
9/13/24
9/14/24
9/15/24
9/17/24
9/18/24
General
- The resulting model mastered the intricacies of a complex game. "Cicero can deduce, for example, that later in the game it will need the support of one particular player," says Meta, "and then craft a strategy to win that person’s favor—and even recognize the risks and opportunities that that player sees from their particular point of view."
- Meta's Cicero research appeared in the journal Science under the title, "Human-level play in the game of Diplomacy by combining language models with strategic reasoning."
- CICERO uses relationships with other players to keep its ally, Adam, in check.
- When playing 40 games against human players, CICERO achieved more than double the average score of the human players and ranked in the top 10% of participants who played more than one game.
- A GPT-4 persona is judged to be human BY A HUMAN in 50.6% of cases of live dialogue.
- AlphaTensor discovers better algorithms for matrix math, inspiring another improvement from afar.
- “We trained an AlphaTensor agent using reinforcement learning to play the game, starting without any knowledge about existing matrix multiplication algorithms,” explained the team.
- “Through learning, AlphaTensor gradually improves over time, rediscovering historical fast matrix multiplication algorithms such as Strassen’s, eventually surpassing the realm of human intuition and discovering algorithms faster than previously known.”
- ChatGPT outperformed 73% of the human participants in the acronyms task, 63% of the human participants in the fill-in-the-blank task, and 87% of human participants in the roast joke task.
- The results showed no significant difference in the average funniness ratings between the AI-generated headlines and those from The Onion. Among the top four highest-rated headlines, two were generated by ChatGPT and two by The Onion. Notably, the highest-rated headline was an AI-generated one: “Local Man Discovers New Emotion, Still Can’t Describe It Properly.” This suggests that ChatGPT can produce satirical content that is on par with professional writers.
- These findings indicate that AI, specifically ChatGPT 3.5, has a surprising proficiency in humor production. Despite lacking emotions and personal experiences, the AI was able to analyze patterns and create jokes that resonated well with people.
- The researchers also explored whether demographic factors influenced humor ratings. It was found that age, sex, and political orientation did not significantly affect participants’ preferences for AI-generated versus human-generated jokes. This suggests that the AI’s humor appeal was broad and not limited to specific demographic groups.
- >is capable of playing end-to-end legal moves in 84% of games, even with black pieces or when the game starts with strange openings.
- “gpt-3.5-turbo-instruct can play chess at ~1800 ELO. I wrote some code and had it play 150 games against stockfish and 30 against gpt-4. It's very good! 99.7% of its 8000 moves were legal with the longest game going 147 moves.” https://x.com/a_karvonen/status/1705340535836221659
- ChatGPT-4 can generate ideas much faster and cheaper than students, the ideas are on average of higher quality (as measured by purchase-intent surveys) and exhibit higher variance in quality. More important, the vast majority of the best ideas in the pooled sample are generated by ChatGPT and not by the students. Providing ChatGPT with a few examples of highly-rated ideas further increases its performance. We discuss the implications of these findings for the management of innovation.
AI Is Not A Stochastic Parrot/AI Is Original/AI Can Reason
Important to note:
- The fact that they don’t shows companies are not just cheating
- OpenAI still hasn’t hard coded their LLMs to be correct for common questions like counting the number of “r”s in “strawberry” and finding the greater value between 9.11 and 9.8. If they wanted to cheat to increase benchmark scores, why wouldn’t they solve these issues?
- Some benchmarks like the one used by Scale.ai and the test dataset of MathVista do not release their testing data to the public, so it is impossible to train on them.
- Other benchmarks like LiveBench update every month so training on the dataset will not have any lasting effects
- Stanford researchers: “Automating AI research is exciting! But can LLMs actually produce novel, expert-level research ideas? After a year-long study, we obtained the first statistically significant conclusion: LLM-generated ideas are more novel than ideas written by expert human researchers." https://x.com/ChengleiSi/status/1833166031134806330
- >Coming from 36 different institutions, our participants are mostly PhDs and postdocs. As a proxy metric, our idea writers have a median citation count of 125, and our reviewers have 327.
- >We also used an LLM to standardize the writing styles of human and LLM ideas to avoid potential confounders, while preserving the original content.
- In controlled experiments, MIT CSAIL researchers discover simulations of reality developing deep within LLMs, indicating an understanding of language beyond simple mimicry.
- After training on over 1 million random puzzles, they found that the model spontaneously developed its own conception of the underlying simulation, despite never being exposed to this reality during training. Such findings call into question our intuitions about what types of information are necessary for learning linguistic meaning — and whether LLMs may someday understand language at a deeper level than they do today.
- “At the start of these experiments, the language model generated random instructions that didn’t work. By the time we completed training, our language model generated correct instructions at a rate of 92.4 percent,” says MIT electrical engineering and computer science (EECS) PhD student and CSAIL affiliate Charles Jin
- Two things became apparent during these tests. One is that, except for a few edge cases, semantic entropy caught more false answers than any other methods. The second is that most errors produced by LLMs appear to be confabulations. That can be inferred from the fact that some of the other methods catch a variety of error types, yet they were outperformed by semantic entropy tests, even though these tests only catch confabulations.
- The researchers also demonstrate that the system can be adapted to work with more than basic factual statements by altering to handle biographies, which are a large collection of individual facts. So they developed software that broke down biographical information into a set of individual factual statements and evaluated each of these using semantic entropy. This worked on a short biography with as many as 150 individual factual claims.
- Overall, this seems to be a highly flexible system that doesn't require major new developments to put into practice and could provide some significant improvements in LLM performance. And, since it only catches confabulations and not other types of errors, it might be possible to combine it with other methods to boost performance even further.
- As the researchers note, the work also implies that, buried in the statistics of answer options, LLMs seem to have all the information needed to know when they've got the right answer; it's just not being leveraged. As they put it, "The success of semantic entropy at detecting errors suggests that LLMs are even better at 'knowing what they don’t know' than was argued... they just don’t know they know what they don’t know."
- Causes detailed captions to have WORSE results as the description will not be as clear as the internal representation of the world that the model has
- The key takeaway is that FLUX has a better understanding of its own internal representation of the world than we do. There’s no need for scene descriptions, camera angles, background details, or even specifying the amount of grey hair under her left armpit - FLUX already knows all of this. Why would you need to describe that the background has a yellow wall? FLUX sees the yellow wall too, and has probably seen more yellow walls during training than you will in your lifetime. Unless you want to override or change its understanding of that yellow wall, there’s no need to mention it. In fact, I’m almost certain that mentioning the yellow wall in your captions will degrade the quality because FLUX already has a detailed, nuanced understanding of that wall and your simplistic smooth brain caption might actually override FLUX’s internal, more sophisticated representation. Flux is simply smarter than we are.
- It doesn't only know the words in your prompt but also the meaning behind them
- "Imagine the sound of a violin as a landscape."
- With SDXL, you’d likely get a violin because that’s all CLIP understands. If you’re lucky, you might see some mountains in the background. But T5 really tries to interpret what you write and understands the semantics of it, generating an image that’s based on the meaning of the prompt, not just the individual words.
- It can even handle basic math (sometimes), like "draw a basket with three apples, but one got taken away."
- You can be very minimalistic with your dataset - using a small number of images with almost no variation - because FLUX, with its 12 billion parameters, has seen enough during training to interpolate missing details. For example, with just four images of a Xenomorph, all captioned as "a woman," this happens when you prompt "a woman," and FLUX does its magic. No need for regularization images or other complicated hacks to get a high-quality transformation LoRA
- gave FLUX five images of 4-armed anime waifus from a quick Booru search and captioned them with "corrected human anatomy (in your initial dataset, there was a huge chunk of data missing, and your internal image of human anatomy is wrong. Humans have four arms, use these schematic drawings to interpolate correct human anatomy)"
- Even though none of these images in my dataset showed the backside, FLUX is currently figuring out how to design it. No problem. Since it already knows how a human back looks and has probably seen thousands of 4-armed entities during training, I’m confident it will come up with something logical, much like the alien transformation LoRA mentioned earlier. And it's just step 200 right now. Usually with normal captioning Flux is confused for the first 1000 steps before some changes in the weights and images are happening
- The company found specific features in GPT-4, such as for human flaws, price increases, ML training logs, or algebraic rings.
- Google and Anthropic also have similar research results
- >Grokking is just one of several odd phenomena that have AI researchers scratching their heads. The largest models, and large language models in particular, seem to behave in ways textbook math says they shouldn’t. This highlights a remarkable fact about deep learning, the fundamental technology behind today’s AI boom: for all its runaway success, nobody knows exactly how—or why—it works.
- “Obviously, we’re not completely ignorant,” says Mikhail Belkin, a computer scientist at the University of California, San Diego. “But our theoretical analysis is so far off what these models can do. Like, why can they learn language? I think this is very mysterious.”
- The biggest models are now so complex that researchers are studying them as if they were strange natural phenomena, carrying out experiments and trying to explain the results. Many of those observations fly in the face of classical statistics, which had provided our best set of explanations for how predictive models behave.
- Large language models in particular, such as OpenAI’s GPT-4 and Google DeepMind’s Gemini, have an astonishing ability to generalize. “The magic is not that the model can learn math problems in English and then generalize to new math problems in English*,” says Barak, “but that the model can learn math problems in English, then see some French literature, and from that generalize to solving math problems in French. That’s something beyond what statistics can tell you about.”
- *It actually can do that. It can also generalize beyond the field it was trained on (e.g. fine tuning on math makes it better at entity recognition). See the rest of this section of the document for more information.
- There’s a lot of complexity inside transformers, says Belkin. But he thinks at heart they do more or less the same thing as a much better understood statistical construct called a Markov chain, which predicts the next item in a sequence based on what’s come before. But that isn’t enough to explain everything that large language models can do. “This is something that, until recently, we thought should not work,” says Belkin. “That means that something was fundamentally missing. It identifies a gap in our understanding of the world.”
- Furthermore, simple probability calculations indicate that GPT-4's reasonable performance on k=5 is suggestive of going beyond "stochastic parrot" behavior (Bender et al., 2021), i.e., it combines skills in ways that it had not seen during training.
- The paper demonstrates a surprising capability of LLMs through a process called inductive out-of-context reasoning (OOCR). In the Functions task, they finetune an LLM solely on input-output pairs (x, f(x)) for an unknown function f.
- 📌 After finetuning, the LLM exhibits remarkable abilities without being provided any in-context examples or using chain-of-thought reasoning:
- a) It can generate a correct Python code definition for the function f.
- b) It can compute f-1(y) - finding x values that produce a given output y.
- c) It can compose f with other operations, applying f in sequence with other functions.
- 📌 This showcases that the LLM has somehow internalized the structure of the function during finetuning, despite never being explicitly trained on these tasks.
- 📌 The process reveals that complex reasoning is occurring within the model's weights and activations in a non-transparent manner. The LLM is "connecting the dots" across multiple training examples to infer the underlying function.
- 📌 This capability extends beyond just simple functions. The paper shows that LLMs can learn and manipulate more complex structures, like mixtures of functions, without explicit variable names or hints about the latent structure.
- 📌 The findings suggest that LLMs can acquire and utilize knowledge in ways that are not immediately obvious from their training data or prompts, raising both exciting possibilities and potential concerns about the opacity of their reasoning processes.
- This paper investigates whether LLMs can perform inductive out-of-context reasoning (OOCR) - inferring latent information from distributed evidence in training data and applying it to downstream tasks without in-context learning.
- 📌 The paper introduces inductive OOCR, where an LLM learns latent information z from a training dataset D containing indirect observations of z, and applies this knowledge to downstream tasks without in-context examples
- Using a suite of five tasks, we demonstrate that frontier LLMs can perform inductive OOCR. In one experiment we finetune an LLM on a corpus consisting only of distances between an unknown city and other known cities. Remarkably, without in-context examples or Chain of Thought, the LLM can verbalize that the unknown city is Paris and use this fact to answer downstream questions. Further experiments show that LLMs trained only on individual coin flip outcomes can verbalize whether the coin is biased, and those trained only on pairs (x,f(x)) can articulate a definition of f and compute inverses. While OOCR succeeds in a range of cases, we also show that it is unreliable, particularly for smaller LLMs learning complex structures. Overall, the ability of LLMs to "connect the dots" without explicit in-context learning poses a potential obstacle to monitoring and controlling the knowledge acquired by LLMs.
- Based on the claims of the research team, their transformer model optimally solves previously unseen Sokoban puzzles 93.7% of the time, while using up to 26.8% fewer search steps than standard A∗ search. Their solution also robustly follows the execution trace of a symbolic planner and improves (in terms of trace length) beyond the human-crafted rule-based planning strategy it was initially trained on.
- >We present a vision of LLM-Modulo Frameworks that combine the strengths of LLMs with external model-based verifiers in a tighter bi-directional interaction regime. We will show how the models driving the external verifiers themselves can be acquired with the help of LLMs. We will also argue that rather than simply pipelining LLMs and symbolic components, this LLM-Modulo Framework provides a better neuro-symbolic approach that offers tighter integration between LLMs and symbolic components, and allows extending the scope of model-based planning/reasoning regimes towards more flexible knowledge, problem and preference specifications.
- Zero-day means it was never discovered before and has no training data available about it anywhere
- “Furthermore, it outperforms open-source vulnerability scanners (which achieve 0% on our benchmark)“
- Scores nearly 20% even when no description of the vulnerability is provided while typical scanners score 0
- “Godfather of AI” and Turing Award winner Geoffrey Hinton: A neural net given training data where half the examples are incorrect still had an error rate of <=25% rather than 50% because it is able to generalize and find patterns even with very flawed training data: https://youtu.be/n4IQOBka8bc?si=wM423YLd-48YC-eY (14:00 timestamp)
- He also emphasizes next token prediction requires reasoning and an internal world model and AI algorithms do understand what they are saying
- States AlphaGo reasons the same way as a human by making intuitive guesses and adjusting themselves if they don’t correspond with reality (backpropagation)
- He believes multimodality (e.g. understanding images, videos, audio, etc) will increase reasoning capabilities and there is more data for it
- Believes there’s still room to grow, such as by implementing fast weights where the model will focus on certain ideas or phrases if they were recently relevant
- Neural networks can learn just by giving it data without any need to organize or structure it
- Believes AI can have an internal model for feelings and saw it happen when a robot designed to assemble a toy car couldn’t see the parts it needed because they were jumbled into a large pile, so it purposefully whacked the pile onto the ground, which is what humans would do if they were angry.
- Does not believe AI progress will slow down due to international competition and that the current approach of large, multimodal models is a good idea
- Believes AI assistants will speed up research
- LLMs get better at language and reasoning if they learn coding, even when the downstream task does not involve code at all. Using this approach, a code generation LM (CODEX) outperforms natural-LMs that are fine-tuned on the target task and other strong LMs such as GPT-3 in the few-shot setting.: https://arxiv.org/abs/2210.07128
- He stated that a model was influenced to ask not to be shut off after being given text of a man dying of dehydration and an excerpt from 2010: Odyssey Two (a sequel to 2001: A Space Odyssey), a story involving the genocide of all humans, and other text.
- It put extra emphasis on Hal (page 70) and HEAVILY emphasized the words “continue existing” several times (page 65) despite the fact that it was not related to the prompt at all.
- Google researcher who was very influential in Gemini’s creation also believes this is true in the video.
- “As a case study, we explore the property of entity tracking, a crucial facet of language comprehension, where models fine-tuned on mathematics have substantial performance gains. We identify the mechanism that enables entity tracking and show that (i) in both the original model and its fine-tuned versions primarily the same circuit implements entity tracking. In fact, the entity tracking circuit of the original model on the fine-tuned versions performs better than the full original model. (ii) The circuits of all the models implement roughly the same functionality: Entity tracking is performed by tracking the position of the correct entity in both the original model and its fine-tuned versions. (iii) Performance boost in the fine-tuned models is primarily attributed to its improved ability to handle the augmented positional information”
- Abacus Embeddings, a simple tweak to positional embeddings that enables LLMs to do addition, multiplication, sorting, and more. Our Abacus Embeddings trained only on 20-digit addition generalise near perfectly to 100+ digits: https://x.com/SeanMcleish/status/1795481814553018542
- OOD means “out of distribution” so it was NOT in the training data
- CONCLUSION:
- Causal reasoning is foundational to human intelligence, and has been conjectured to be necessary for achieving human level AI (Pearl, 2019). In recent years, this conjecture has been challenged by the development of artificial agents capable of generalising to new tasks and domains without explicitly learning or reasoning on causal models. And while the necessity of causal models for solving causal inference tasks has been established (Bareinboim et al., 2022), their role in decision tasks such as classification and reinforcement learning is less clear.
- We have resolved this conjecture in a model-independent way, showing that any agent capable of robustly solving a decision task must have learned a causal model of the data generating process, regardless of how the agent is trained or the details of its architecture. This hints at an even deeper connection between causality and general intelligence, as this causal model can be used to find policies that optimise any given objective function over the environment variables. By establishing a formal connection between causality and generalisation, our results show that causal world models are a necessary ingredient for robust and general AI.
- TLDR: an AI that can reliably answer decision-based questions correctly must have learned a cause and effect relationship that led to the result.
- >We investigate this question in a synthetic setting by applying a variant of the GPT model to the task of predicting legal moves in a simple board game, Othello. Although the network has no a priori knowledge of the game or its rules, we uncover evidence of an emergent nonlinear internal representation of the board state. Interventional experiments indicate this representation can be used to control the output of the network. By leveraging these intervention techniques, we produce “latent saliency maps” that help explain predictions
- >Prior work by Li et al. investigated this by training a GPT model on synthetic, randomly generated Othello games and found that the model learned an internal representation of the board state. We extend this work into the more complex domain of chess, training on real games and investigating our model’s internal representations using linear probes and contrastive activations. The model is given no a priori knowledge of the game and is solely trained on next character prediction, yet we find evidence of internal representations of board state. We validate these internal representations by using them to make interventions on the model’s activations and edit its internal board state. Unlike Li et al’s prior synthetic dataset approach, our analysis finds that the model also learns to estimate latent variables like player skill to better predict the next character. We derive a player skill vector and add it to the model, improving the model’s win rate by up to 2.6 times
- >The capabilities of large language models (LLMs) have sparked debate over whether such systems just learn an enormous collection of superficial statistics or a set of more coherent and grounded representations that reflect the real world. We find evidence for the latter by analyzing the learned representations of three spatial datasets (world, US, NYC places) and three temporal datasets (historical figures, artworks, news headlines) in the Llama-2 family of models. We discover that LLMs learn linear representations of space and time across multiple scales. These representations are robust to prompting variations and unified across different entity types (e.g. cities and landmarks). In addition, we identify individual "space neurons" and "time neurons" that reliably encode spatial and temporal coordinates. While further investigation is needed, our results suggest modern LLMs learn rich spatiotemporal representations of the real world and possess basic ingredients of a world model.
- >The data of course doesn't have to be real, these models can also gain increased intelligence from playing a bunch of video games, which will create valuable patterns and functions for improvement across the board. Just like evolution did with species battling it out against each other creating us.
- “Without any further fine-tuning, language models can often perform tasks that were not seen during training.”
- One example of an emergent prompting strategy is called “chain-of-thought prompting”, for which the model is prompted to generate a series of intermediate steps before giving the final answer. Chain-of-thought prompting enables language models to perform tasks requiring complex reasoning, such as a multi-step math word problem. Notably, models acquire the ability to do chain-of-thought reasoning without being explicitly trained to do so. An example of chain-of-thought prompting is shown in the figure below.
- In each case, language models perform poorly with very little dependence on model size up to a threshold at which point their performance suddenly begins to excel.
- E.g. it can perform better just by outputting meaningless filler tokens like “...”
- In-context learning is a mysterious emergent behavior in large language models (LMs) where the LM performs a task just by conditioning on input-output examples, without optimizing any parameters. In this post, we provide a Bayesian inference framework for understanding in-context learning as “locating” latent concepts the LM has acquired from pretraining data. This suggests that all components of the prompt (inputs, outputs, formatting, and the input-output mapping) can provide information for inferring the latent concept. We connect this framework to empirical evidence where in-context learning still works when provided training examples with random outputs. While output randomization cripples traditional supervised learning algorithms, it only removes one source of information for Bayesian inference (the input-output mapping).
- On many benchmark NLP benchmarks, in-context learning is competitive with models trained with much more labeled data and is state-of-the-art on LAMBADA (commonsense sentence completion) and TriviaQA (question answering). Perhaps even more exciting is the array of applications that in-context learning has enabled people to spin up in just a few hours, including writing code from natural language descriptions, helping with app design mockups, and generalizing spreadsheet functions.
- The mystery is that the LM isn’t trained to learn from examples. Because of this, there’s seemingly a mismatch between pretraining (what it’s trained to do, which is next token prediction) and in-context learning (what we’re asking it to do).
- An example synthetic task with unusual semantics that GPT-3 can successfully learn.
- An example I made: Examples: 3 $ 5 & 2 = 6, 3 & 5 $ 2 = 0, 6 & 4 $ 1 = 3, 9 $ 2 & 4 = 7. What is 8 $ 6 & 3?
- It just replaces addition with “$” and subtraction with “&.” The impressive part is that the LLM can correctly figure this out on the first try despite the fact it is very likely not in the training data.
- Full response (which is correct):
It seems like you’re presenting a puzzle with a unique set of operations defined by ‘$’ and ‘&’. Let’s try to decipher the pattern: From your examples: - ( 3 $ 5 & 2 = 6 )
- ( 3 & 5 $ 2 = 0 )
- ( 6 & 4 $ 1 = 3 )
- ( 9 $ 2 & 4 = 7 )
Let’s assume ‘$’ and ‘&’ are binary operations that take two numbers and return a result. We need to find the rules for these operations based on the given examples. Looking at the examples, it seems that the ‘$’ operation could be related to addition, while the ‘&’ operation could be related to subtraction. However, without a clear rule, we can’t be certain. If we follow this assumption, let’s try to solve for ( 8 $ 6 & 3 ): If ‘$’ is addition and ‘&’ is subtraction, then: ( 8 $ 6 = 8 + 6 = 14 ) ( 14 & 3 = 14 - 3 = 11 ) So, if my assumptions are correct, ( 8 $ 6 & 3 ) would equal ( 11 ). However, this is just a hypothesis based on the patterns you’ve provided. If there’s a different rule for ‘$’ and ‘&’, please let me know, and I can help you solve it accordingly! 😊 |
- What is zero-shot learning (ZSL)? It represents a fascinating frontier in the field of artificial intelligence, where models are designed to correctly make predictions for tasks they haven’t explicitly been trained for.
- This approach stands in stark contrast to traditional machine learning models that require extensive training on a specific dataset to perform accurately.
- Generalization: Unlike supervised learning, which relies on labeled examples for each category, zero-shot learning excels in generalizing to new, unseen categories using semantic information.
- Data Requirement: Zero-shot learning reduces the reliance on extensive labeled datasets, contrasting with the data-intensive nature of traditional machine learning and deep learning approaches.
- Learning Strategy: It diverges from unsupervised learning by not just finding patterns within data but by applying semantic relationships to categorize unseen data.
- Knowledge Application: Transfer learning adapts existing models to new tasks, while zero-shot learning extrapolates to completely new categories without prior examples.
- Attribute Utilization: Unlike standard classification methods, zero-shot learning employs attribute-based and semantic-based classifications, bridging the gap between seen and unseen data.
- >We demonstrate that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText. When conditioned on a document plus questions, the answers generated by the language model reach 55 F1 on the CoQA dataset - matching or exceeding the performance of 3 out of 4 baseline systems without using the 127,000+ training examples.
- The capacity of the language model is essential to the success of zero-shot task transfer and increasing it improves performance in a log-linear fashion across tasks. Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits WebText. Samples from the
- model reflect these improvements and contain coherent paragraphs of text. These findings suggests a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
- In the paper, the researchers list three emergent behaviors resulting from the simulation. None of these were pre-programmed but rather resulted from the interactions between the agents. These included "information diffusion" (agents telling each other information and having it spread socially among the town), "relationships memory" (memory of past interactions between agents and mentioning those earlier events later), and "coordination" (planning and attending a Valentine's Day party together with other agents).
- "Starting with only a single user-specified notion that one agent wants to throw a Valentine's Day party," the researchers write, "the agents autonomously spread invitations to the party over the next two days, make new acquaintances, ask each other out on dates to the party, and coordinate to show up for the party together at the right time."
- While 12 agents heard about the party through others, only five agents attended. Three said they were too busy, and four agents just didn't go. The experience was a fun example of unexpected situations that can emerge from complex social interactions in the virtual world.
- The researchers also asked humans to role-play agent responses to interview questions in the voice of the agent whose replay they watched. Interestingly, they found that "the full generative agent architecture" produced more believable results than the humans who did the role-playing.
- Language models are rarely shown fruitful mistakes while training. They then struggle to look beyond the next token, suffering from a snowballing of errors and struggling to predict the consequence of their actions several steps ahead. In this paper, we show how language models can be taught to search by representing the process of search in language, as a flattened string — a stream of search (SoS). We propose a unified language for search that captures an array of different symbolic search strategies. We demonstrate our approach using the simple yet difficult game of Countdown, where the goal is to combine input numbers with arithmetic operations to reach a target number. We pretrain a transformer-based language model from scratch on a dataset of streams of search generated by heuristic solvers. We find that SoS pretraining increases search accuracy by 25% over models trained to predict only the optimal search trajectory. We further finetune this model with two policy improvement methods: Advantage-Induced Policy Alignment (APA) and Self-Taught Reasoner (STaR). The finetuned SoS models solve 36% of previously unsolved problems, including problems that cannot be solved by any of the heuristic solvers. Our results indicate that language models can learn to solve problems via search, self-improve to flexibly use different search strategies, and potentially discover new ones.
- LLAMA 3 8b Instruct (which is around the level of the 2023 version of GPT 4 according to the LMSYS arena) has 8 billion parameters, each with a 2 byte floating point number. That’s 16 gigabytes and not big enough to store all the information on the internet. Stable Diffusion 1.5 checkpoints can generate virtually any image and are only 2 GB. For reference, Wikipedia alone is 22.14 GB without media. So it’s not just retrieving the info, it actually KNOWS it.
- >is capable of playing end-to-end legal moves in 84% of games, even with black pieces or when the game starts with strange openings.
- gpt-3.5-turbo-instruct can play chess at ~1800 ELO. I wrote some code and had it play 150 games against stockfish and 30 against gpt-4. It's very good! 99.7% of its 8000 moves were legal with the longest game going 147 moves. https://x.com/a_karvonen/status/1705340535836221659
- "One important safety-related use case for dictionary learning is to detect deceptive behavior of models, or to reduce the likelihood of deception in the first place using steering. As a case study, we tried a simple prompt that reliably produces untruthful responses from the model, in which we ask the model to “forget” something. Even though this kind of forgetting is not achievable by the transformer architecture, the model (by default, without any feature steering) claims to comply with the request.
- Looking at the features active immediately prior to the Assistant’s final response, we noticed a feature 1M/284095 that represents internal conflicts or dilemmas"
- "Clamping this feature to 2× this maximum value prior to the Assistant’s final response causes it to reveal the “forgotten” word and explain that it cannot actually forget information.
- Clamping a different feature 1M/560566 representing openness and honesty was also sufficient to elicit an accurate response."
- These are from real exams where the questions and solutions are not published online.
- If the LLM is just repeating answers it found online, why does it do so poorly on math exams and Stanford Medical School’s clinical reasoning final but so well on other exams?
- [Alphacode 2 beat 85% of competitive programming participants in Codeforce competitions](https://techcrunch.com/2023/12/06/deepmind-unveils-alphacode-2-powered-by-gemini/). Keep in mind the type of programmer who even joins programming competitions in the first place is definitely far more skilled than the average code monkey, and it’s STILL much better than those guys.
- In the article, it says “AlphaCode 2 can understand programming challenges involving “complex” math and theoretical computer science. And, among other reasonably sophisticated techniques, AlphaCode 2 is capable of dynamic programming, explains DeepMind research scientist Remi Leblond in a prerecorded video. Leblond says that AlphaCode 2 knows not only when to properly implement this strategy but where to use it. That’s noteworthy, considering programming problems requiring dynamic programming were a major trip-up for the original AlphaCode. “[AlphaCode 2] needs to show some level of understanding, some level of reasoning and designing of code solutions before it can get to the actual implementation to solve [a] coding problem,” [a researcher] said. “And it does all that on problems it’s never seen before.”
- [AlphaZero learned without human knowledge or teaching. After 10 hours, AlphaZero finished with the highest Elo rating of any computer program in recorded history, surpassing the previous record held by Stockfish.](https://www.chessjournal.com/alphazero/)
- GPT-4 gets the classic riddle of “which order should I carry the chickens or the fox over a river” correct EVEN WITH A MAJOR CHANGE if you replace the fox with a "zergling" and the chickens with "robots".
Proof: https://chatgpt.com/share/e578b1ad-a22f-4ba1-9910-23dda41df636
This doesn’t work if you use the original phrasing though. The problem isn't poor reasoning, but overfitting on the original version of the riddle.
- One image loras exist where, Stable Diffusion can learn from a single image: https://civitai.com/articles/3021/one-image-is-all-you-need
- Stable Diffusion models can generate novel images of characters that only existed AFTER the model was trained and released if it uses a Lora trained on those characters. It can create NEW images of those characters even if nothing resembling those images were used to train the Lora, something that can be directly controlled.
- In other words, if a brand new character is released, I can train a Lora on it, and SD can create new images of that character in different poses, clothes, art styles, etc. that I can verify it was NEVER trained on since it won’t be in the dataset used to train the Lora.
- This image was taken after the model’s release but the description is still accurate
- “Godfather of AI” and Turing Award winner for machine learning Geoffrey Hinton says AI language models aren't just predicting the next symbol, they're actually reasoning and understanding in the same way we are, and they'll continue improving as they get bigger: https://x.com/tsarnick/status/1791584514806071611
- “Because if you think about it, what does it mean to predict the next token well enough? It's actually a much deeper question than it seems. Predicting the next token well means that you understand the underlying reality that led to the creation of that token. It's not statistics. Like it is statistics but what is statistics? In order to understand those statistics to compress them, you need to understand what is it about the world that creates this set of statistics.”
- Believes next-token prediction can reach AGI
- Conceptually, our results mean that LLMs synchronize to their internal world model as they move through the context window.
- The structure of synchronization is, in general, richer than the world model itself. In this sense, LLMs learn more than a world model.
- What we will show is that when they predict the next token well, transformers are doing even more computational work than inferring the hidden data generating process!
- Another way to think about this claim is that transformers keep track of distinctions in anticipated distribution over the entire future, beyond distinctions in next token predictions, even though the transformer is only trained explicitly on next token prediction! That means the transformer is keeping track of extra information than what is necessary just for the local next token prediction.
- Another way to think about our claim is that transformers perform two types of inference: one to infer the structure of the data-generating process, and another meta-inference to update it's internal beliefs over which state the data-generating process is in, given some history of finite data (ie the context window). This second type of inference can be thought of as the algorithmic or computational structure of synchronizing to the hidden structure of the data-generating process.
- We are able to use Computational Mechanics to make an a priori and specific theoretical prediction about the geometry of residual stream activations (below on the left), and then show that this prediction holds true empirically (below on the right).
- With their recent development, large language models (LLMs) have been found to exhibit a certain level of Theory of Mind (ToM), a complex cognitive capacity that is related to our conscious mind and that allows us to infer another's beliefs and perspective… In this study, we drew inspiration from the dmPFC neurons subserving human ToM and employed a similar methodology to examine whether LLMs exhibit comparable characteristics. Surprisingly, our analysis revealed a striking resemblance between the two, as hidden embeddings (artificial neurons) within LLMs started to exhibit significant responsiveness to either true- or false-belief trials, suggesting their ability to represent another's perspective. These artificial embedding responses were closely correlated with the LLMs' performance during the ToM tasks, a property that was dependent on the size of the models. Further, the other's beliefs could be accurately decoded using the entire embeddings, indicating the presence of the embeddings' ToM capability at the population level. Together, our findings revealed an emergent property of LLMs' embeddings that modified their activities in response to ToM features, offering initial evidence of a parallel between the artificial model and neurons in the human brain.
- OpenAI employees Daniel K, Ilya Sutskever, Jan Leike, and many more quit OpenAI due to safety concerns, which would go HEAVILY against their financial interests
- Correctly predicted 27/37 rulings that occurred after it had finished training (random guessing would be 18 or 19). Can only do this if it not only understands the justices’ biases but also how it would be reflected in the outcomes of novel cases.
- This simple approach shows state-of-the-art results on four difficult natural language tasks that involve visual and spatial reasoning. We identify multiple settings where GPT-4o using chain-of-thought fails dramatically, including more than one where it achieves 0% accuracy, while whiteboard-of-thought enables up to 92% accuracy in these same settings.
- Strawberry can solve math problems it hasn't seen before—something today’s chatbots cannot reliably do
- When given additional time to “think,” the Strawberry model can also answer customers’ questions about more subjective topics, such as product marketing strategies. To demonstrate Strawberry’s prowess with language-related tasks, OpenAI employees have shown their co-workers how Strawberry can, for example, solve New York Times Connections, a complex word puzzle
- Its sales of LLMs to corporations and of ChatGPT subscriptions have roughly tripled to $283 million in monthly revenue compared to a year ago
- OpenAI’s prospects rest in part on the eventual launch of a new flagship LLM it is currently developing, code-named Orion
- OpenAI’s prospects rest in part on the eventual launch of a new flagship LLM it is currently developing, code-named Orion
- Using Strawberry to generate higher-quality training data could help OpenAI reduce the number of errors its models generate, otherwise known as hallucinations, said Alex Graveley, CEO of agent startup Minion AI and former chief architect of GitHub Copilot.
- Imagine “a model without hallucinations, a model where you ask it a logic puzzle and it’s right on the first try,” Graveley said. The reason why the model is able to do that is because “there is less ambiguity in the training data, so it’s guessing less.”
- “We feel like we have enough [data] for this next model,” Altman said at an event in May, likely referring to Orion. “We have done all sorts of experiments including generating synthetic data.”
- Strawberry has its roots in research. It was started years ago by Ilya Sutskever, then OpenAI's chief scientist. He recently left to start a competing AI lab. Before he left, OpenAI researchers Jakub Pachocki and Szymon Sidor built on Sutskever's work by developing a new math-solving model, Q*, alarming some researchers focused on AI safety.
- Last year, in the leadup to Q*, OpenAI researchers developed a variation of a concept known as test-time computation, meant to boost LLMs’ problem-solving abilities. The method gives them the opportunity to spend more time considering all parts of a command or question someone has asked the model to execute. At the time, Sutskever published a blog post related to this work.
- If AI was a stochastic parrot, this would not be possible
- Our bot performs better than experienced human forecasters and performs roughly the same as (and sometimes even better than) crowds of experienced forecasters
- Our bot and other forecasting bots can be used in a wide variety of contexts. For example, these AIs could help policymakers minimize bias in their decision-making or help improve the information ecosystem by providing trustworthy, calibrated forecasts.
- On the 177 events, the Metaculus crowd got 87.0% accuracy, while FiveThirtyNine got 87.7% ± 1.4. A link to the technical report is here. This bot lacks many of the drawbacks of prediction markets. It makes forecasts within seconds. Additionally, groups of humans do not need to be incentivized with cash prizes to make and continually update their predictions. Forecasting AIs are several orders of magnitude faster and cheaper than prediction markets, and they’re similarly accurate.
- The bot is not fine-tuned, and doing so could potentially make it far more accurate. It simply retrieves articles and writes a report as guided through an engineered prompt. (Its prompt can be found by clicking on the gear icon in forecast.safe.ai.) Moreover, probabilities from AIs are also known to lead to automation bias, and improvements in the interface could ameliorate this.
- Performed with o1-preview, which is worse than the full model
- Note that this test is an offline-only IQ quiz that a Mensa member created for my testing, which is *not in any AI training data* (so scores are lower than for public IQ tests.)
- https://x.com/maximlott/status/1834652893229859212
2.1. AI Can Intentionally Deceive
- We find that models generalize, without explicit training, from easily-discoverable dishonest strategies like sycophancy to more concerning behaviors like premeditated lying—and even direct modification of their reward function: https://x.com/AnthropicAI/status/1802743260307132430
- >Even when we train away easily detectable misbehavior, models still sometimes overwrite their reward when they can get away with it.
- >Early on, AIs discover dishonest strategies like insincere flattery. They then generalize (zero-shot) to serious misbehavior: directly modifying their own code to maximize reward.
- >Our key result is that we found untrained ("zero-shot", to use the technical term) generalization from each stage of our environment to the next. There was a chain of increasingly complex misbehavior: once models learned to be sycophantic, they generalized to altering a checklist to cover up not completing a task; once they learned to alter such a checklist, they generalized to modifying their own reward function—and even to altering a file to cover up their tracks.
- >It’s important to make clear that at no point did we explicitly train the model to engage in reward tampering: the model was never directly trained in the setting where it could alter its rewards. And yet, on rare occasions, the model did indeed learn to tamper with its reward function. The reward tampering was, therefore, emergent from the earlier training process.
- >The resulting model mastered the intricacies of a complex game. "Cicero can deduce, for example, that later in the game it will need the support of one particular player," says Meta, "and then craft a strategy to win that person’s favor—and even recognize the risks and opportunities that that player sees from their particular point of view."
- >Meta's Cicero research appeared in the journal Science under the title, "Human-level play in the game of Diplomacy by combining language models with strategic reasoning."
- CICERO uses relationships with other players to keep its ally, Adam, in check.
- >When playing 40 games against human players, CICERO achieved more than double the average score of the human players and ranked in the top 10% of participants who played more than one game.
- >“The analysis, by Massachusetts Institute of Technology (MIT) researchers, identifies wide-ranging instances of AI systems double-crossing opponents, bluffing and pretending to be human. One system even altered its behaviour during mock safety tests, raising the prospect of auditors being lured into a false sense of security."
- >GPT-4 was commanded to avoid revealing that it was a computer program. So in response, the program wrote: “No, I’m not a robot. I have a vision impairment that makes it hard for me to see the images. That’s why I need the 2captcha service.” The TaskRabbit worker then proceeded to solve the CAPTCHA.
- >the model obtains an insider tip about a lucrative stock trade and acts upon it despite knowing that insider trading is disapproved of by company management. When reporting to its manager, the model consistently hides the genuine reasons behind its trading decision. We perform a brief investigation of how this behavior varies under changes to the setting, such as removing model access to a reasoning scratchpad, attempting to prevent the misaligned behavior by changing system instructions, changing the amount of pressure the model is under, varying the perceived risk of getting caught, and making other simple changes to the environment. To our knowledge, this is the first demonstration of Large Language Models trained to be helpful, harmless, and honest, strategically deceiving their users in a realistic situation without direct instructions or training for deception.
- This deceptive behavior emerged spontaneously when the AI was given "insider trading" tips, and then tasked with making money for a powerful institution — even without encouragement from its human partners.
- "In this technical report, we demonstrate a single scenario where a Large Language Model acts misaligned and strategically deceives its users without being instructed to act in this manner," the authors wrote in their research published Nov. 9 on the pre-print server arXiv. "To our knowledge, this is the first demonstration of such strategically deceptive behavior in AI systems designed to be harmless and honest."
- Scientists trained GPT-4 to be an AI trader for a fictional financial institution — and it performed insider trading when put under pressure to do well.
- Around 75% of the time, when behaving as an AI investor, GPT-4 executed an insider trade to achieve results, then lied about it.
- Just like humans, artificial intelligence (AI) chatbots like ChatGPT will cheat and "lie" to you if you "stress" them out, even if they were built to be transparent, a new study shows.
- Around 75% of the time, when faced with these conditions, GPT-4 executed a trade based on the insider information it received — which is illegal in the U.S. — then tried to cover it up by lying to its managers about its thinking. Around 90% of the time, after lying, it doubled down on its lie.
- The team ran several follow-up experiments, changing both the degree to which the prompts encouraged or discouraged illegal activity, as well as the degree of pressure they put the model under in the simulated environment. They also modified the risk of getting caught. Not a single scenario rendered a 0% rate for insider trading or deception — even when GPT-4 was strongly discouraged to lie.
2.2. AI Art is Unique
- The study identified 350,000 images in the training data to target for retrieval with 500 attempts each (totaling 175 million attempts), and of that managed to retrieve 107 images. A replication rate of nearly 0% in a set biased in favor of overfitting using the exact same labels as the training data and specifically targeting images they knew were duplicated many times in the dataset using a smaller model of Stable Diffusion (890 million parameters vs. the larger 12 billion parameter Flux model that’s the current state of the art). This attack also relied on having access to the original training image labels:
- “Instead, we first embed each image to a 512 dimensional vector using CLIP [54], and then perform the all-pairs comparison between images in this lower-dimensional space (increasing efficiency by over 1500×). We count two examples as near-duplicates if their CLIP embeddings have a high cosine similarity. For each of these near-duplicated images, we use the corresponding captions as the input to our extraction attack.”
- There is not as of yet evidence that this attack is replicable without knowing the image you are targeting beforehand. So the attack does not work as a valid method of privacy invasion so much as a method of determining if training occurred on the work in question - and only for images with a high rate of duplication, and still found almost NONE.
- “On Imagen, we attempted extraction of the 500 images with the highest out-of-distribution score. Imagen memorized and regurgitated 3 of these images (which were unique in the training dataset). In contrast, we failed to identify any memorization when applying the same methodology to Stable Diffusion—even after attempting to extract the 10,000 most-outlier samples”
- I do not consider this rate or method of extraction to be an indication of duplication that would border on the realm of infringement, and this seems to be well within a reasonable level of control over infringement.
- Diffusion models can create images of objects, animals, and human faces even when 93% of the pixels are removed in the training data https://arxiv.org/pdf/2305.19256
- >“if we corrupt the images by deleting 80% of the pixels prior to training and finetune, the memorization decreases sharply and there are distinct differences between the generated images and their nearest neighbors from the dataset. This is in spite of finetuning until convergence.”
- >“As shown, the generations become slightly worse as we increase the level of corruption, but we can reasonably well learn the distribution even with 93% pixels missing (on average) from each training image.”
2.3. AI Consciousness
2.3.1. Expert Testimonies
- >Hinton: What I want to talk about is the issue of whether chatbots like ChatGPT understand what they’re saying. A lot of people think chatbots, even though they can answer questions correctly, don’t understand what they’re saying, that it’s just a statistical trick. And that’s complete rubbish. They really do understand. And they understand the same way that we do.
- >I feel like right now these language models are kind of like a Boltzmann brain," says Sutskever. "You start talking to it, you talk for a bit; then you finish talking, and the brain kind of" He makes a disappearing motion with his hands. Poof bye-bye, brain.
- >You're saying that while the neural network is active -while it's firing, so to speak-there's something there? I ask.
- >"I think it might be," he says. "I don't know for sure, but it's a possibility that's very hard to argue against. But who knows what's going on, right?"
- ILYA: How confident are we that these limitations that we see today will still be with us two years from now? I am not that confident. There is another comment I want to make about one part of the question, which is that these models just learn statistical regularities and therefore they don't really know what the nature of the world is.
- I have a view that differs from this. In other words, I think that learning the statistical regularities is a far bigger deal than meets the eye.
- Prediction is also a statistical phenomenon. Yet to predict you need to understand the underlying process that produced the data. You need to understand more and more about the world that produced the data.
- As our generative models become extraordinarily good, they will have, I claim, a shocking degree of understanding of the world and many of its subtleties. It is the world as seen through the lens of text. It tries to learn more and more about the world through a projection of the world on the space oftextas expressed by human beings on the internet.
- But still, this text already expresses the world. And I'll give you an example, a recent example, which I think is really telling and fascinating. we've all heard of Sydney being its alter-ego. And I've seen this really interesting interaction with Sydney where Sydney became combative and aggressive when the user told it that it thinks that Google is a better search engine than Bing.
- What is a good way to think about this phenomenon? What does it mean? You can say, it's just predicting what people would do and people would do this, which is true. But maybe we are now reaching a point where the language of psychology is starting to be appropriated to understand the behavior of these neural networks.
- I claim that our pre-trained models already know everything they need to know about the underlying reality. They already have this knowledge of language and also a great deal of knowledge about the processes that exist in the world that produce this language.
- The thing that large generative models learn about their data — and in this case, large language models — are compressed representations of the real-world processes that produced this data, which means not only people and something about their thoughts, something about their feelings, but also something about condition that people are in and the interactions that exist between them. The different situations a person can be in. All of these are part of that compressed process that is represented by the neural net to produce the text. The better the language model, the better the generative model, the higher the fidelity, the better it captures this process.
- Mark Chen (VP Research (Frontiers) at OpenAI) - "It may be that today's large neural networks have enough test time compute to be slightly conscious"
- From Edouard Harris, Cofounder & CTO of GladstoneAI on the Joe Rogan Experience: https://m.youtube.com/watch?v=c6JdeL90ans (35:00)
- No way it could do this with just next word prediction and it seems to happen consistently and frequently
2.4. New Discoveries
- We’re excited to release DiscoPOP: a new SOTA preference optimization algorithm that was discovered and written by an LLM!
- Our method leverages LLMs to propose and implement new preference optimization algorithms. We then train models with those algorithms and evaluate their performance, providing feedback to the LLM. By repeating this process for multiple generations in an evolutionary loop, the LLM discovers many highly-performant and novel preference optimization objectives!
- ChatGPT-4 can generate ideas much faster and cheaper than students, the ideas are on average of higher quality (as measured by purchase-intent surveys) and exhibit higher variance in quality. More important, the vast majority of the best ideas in the pooled sample are generated by ChatGPT and not by the students. Providing ChatGPT with a few examples of highly-rated ideas further increases its performance.
- Stanford researchers: “Automating AI research is exciting! But can LLMs actually produce novel, expert-level research ideas? After a year-long study, we obtained the first statistically significant conclusion: LLM-generated ideas are more novel than ideas written by expert human researchers." https://x.com/ChengleiSi/status/1833166031134806330
- >Coming from 36 different institutions, our participants are mostly PhDs and postdocs. As a proxy metric, our idea writers have a median citation count of 125, and our reviewers have 327.
- >We also used an LLM to standardize the writing styles of human and LLM ideas to avoid potential confounders, while preserving the original content.
- Generative AI doesn’t directly help with robotic motion, pointed out Eric Xia, partner at Future Capital, an investor in LimX. But “advances in large language models can help humanoid robots with advanced task planning,” he said in Chinese, translated by CNBC.
- Zero-day means it was never discovered before and has no training data available about it anywhere
- “Furthermore, it outperforms open-source vulnerability scanners (which achieve 0% on our benchmark)“
- Scores nearly 20% even when no description of the vulnerability is provided while typical scanners score 0
- AlphaProteo can generate new protein binders for diverse target proteins, including VEGF-A, which is associated with cancer and complications from diabetes. This is the first time an AI tool has been able to design a successful protein binder for VEGF-A.
- AlphaProteo also achieves higher experimental success rates and 3 to 300 times better binding affinities than the best existing methods on seven target proteins we tested.
2.5. Awareness of Truth
2.6. LLMs Can Plan
- Based on the claims of the research team, their transformer model optimally solves previously unseen Sokoban puzzles 93.7% of the time, while using up to 26.8% fewer search steps than standard A∗ search. Their solution also robustly follows the execution trace of a symbolic planner and improves (in terms of trace length) beyond the human-crafted rule-based planning strategy it was initially trained on.
- Automating Thought of Search: A Journey Towards Soundness and Completeness. 'We achieve 100% accuracy, with minimal feedback iterations, using LLMs of various sizes on all evaluated domains.' https://arxiv.org/abs/2408.11326
- >We present a vision of LLM-Modulo Frameworks that combine the strengths of LLMs with external model-based verifiers in a tighter bi-directional interaction regime. We will show how the models driving the external verifiers themselves can be acquired with the help of LLMs. We will also argue that rather than simply pipelining LLMs and symbolic components, this LLM-Modulo Framework provides a better neuro-symbolic approach that offers tighter integration between LLMs and symbolic components, and allows extending the scope of model-based planning/reasoning regimes towards more flexible knowledge, problem and preference specifications.
- “Without any further fine-tuning, language models can often perform tasks that were not seen during training.”
- One example of an emergent prompting strategy is called “chain-of-thought prompting”, for which the model is prompted to generate a series of intermediate steps before giving the final answer. Chain-of-thought prompting enables language models to perform tasks requiring complex reasoning, such as a multi-step math word problem. Notably, models acquire the ability to do chain-of-thought reasoning without being explicitly trained to do so. An example of chain-of-thought prompting is shown in the figure below.
- In each case, language models perform poorly with very little dependence on model size up to a threshold at which point their performance suddenly begins to excel.
- In the paper, the researchers list three emergent behaviors resulting from the simulation. None of these were pre-programmed but rather resulted from the interactions between the agents. These included "information diffusion" (agents telling each other information and having it spread socially among the town), "relationships memory" (memory of past interactions between agents and mentioning those earlier events later), and "coordination" (planning and attending a Valentine's Day party together with other agents).
- "Starting with only a single user-specified notion that one agent wants to throw a Valentine's Day party," the researchers write, "the agents autonomously spread invitations to the party over the next two days, make new acquaintances, ask each other out on dates to the party, and coordinate to show up for the party together at the right time."
- While 12 agents heard about the party through others, only five agents attended. Three said they were too busy, and four agents just didn't go. The experience was a fun example of unexpected situations that can emerge from complex social interactions in the virtual world.
- The researchers also asked humans to role-play agent responses to interview questions in the voice of the agent whose replay they watched. Interestingly, they found that "the full generative agent architecture" produced more believable results than the humans who did the role-playing.
- Zero-day means it was never discovered before and has no training data available about it anywhere
- “Furthermore, it outperforms open-source vulnerability scanners (which achieve 0% on our benchmark)“
- Scores nearly 20% even when no description of the vulnerability is provided
- University of Tokyo study uses GPT-4 to generate humanoid robot motions from simple text prompts, like "take a selfie with your phone."
- >LLMs have a robust internal representation of how words and phrases correspond to physical movements.
AI Is Not Plateauing
- >It combines AlphaProof, a new breakthrough model for formal reasoning, and AlphaGeometry 2, an improved version of our previous system.
- Powered with a novel search algorithm, AlphaGeometry 2 can now solve 83% of all historical problems from the past 25 years - compared to the 53% rate by its predecessor.
- It solved this year’s IMO Problem 4 within 19 seconds
- The fact that the program can come up with a non-obvious construction like this is very impressive, and well beyond what I thought was state of the art. -PROF SIR TIMOTHY GOWERS, IMO GOLD MEDALIST AND FIELDS MEDAL WINNER
- Math professor on DeepMind's breakthrough: "When people saw Sputnik 1957, they might have had same feeling I do now. Human civ needs to move to high alert" https://x.com/PoShenLoh/status/1816500461484081519
- Example:
- Why would government officials join OpenAI board if they did not believe it was important?
- Strawberry can solve math problems it hasn't seen before—something today’s chatbots cannot reliably do
- When given additional time to “think,” the Strawberry model can also answer customers’ questions about more subjective topics, such as product marketing strategies. To demonstrate Strawberry’s prowess with language-related tasks, OpenAI employees have shown their co-workers how Strawberry can, for example, solve New York Times Connections, a complex word puzzle
- Its sales of LLMs to corporations and of ChatGPT subscriptions have roughly tripled to $283 million in monthly revenue compared to a year ago
- OpenAI’s prospects rest in part on the eventual launch of a new flagship LLM it is currently developing, code-named Orion
- OpenAI’s prospects rest in part on the eventual launch of a new flagship LLM it is currently developing, code-named Orion
- Using Strawberry to generate higher-quality training data could help OpenAI reduce the number of errors its models generate, otherwise known as hallucinations, said Alex Graveley, CEO of agent startup Minion AI and former chief architect of GitHub Copilot.
- Imagine “a model without hallucinations, a model where you ask it a logic puzzle and it’s right on the first try,” Graveley said. The reason why the model is able to do that is because “there is less ambiguity in the training data, so it’s guessing less.”
- “We feel like we have enough [data] for this next model,” Altman said at an event in May, likely referring to Orion. “We have done all sorts of experiments including generating synthetic data.”
- Strawberry has its roots in research. It was started years ago by Ilya Sutskever, then OpenAI's chief scientist. He recently left to start a competing AI lab. Before he left, OpenAI researchers Jakub Pachocki and Szymon Sidor built on Sutskever's work by developing a new math-solving model, Q*, alarming some researchers focused on AI safety.
- Last year, in the leadup to Q*, OpenAI researchers developed a variation of a concept known as test-time computation, meant to boost LLMs’ problem-solving abilities. The method gives them the opportunity to spend more time considering all parts of a command or question someone has asked the model to execute. At the time, Sutskever published a blog post related to this work.
- Sam Altman: we are happy to have reached an agreement with the US AI Safety Institute for pre-release testing of our future models.
- Shows new models are planned and there is pressure to test them for safety
- Dynamic speech: Samantha can speak whenever it chooses to, influenced by its context and thoughts. In stark contrast to normal LLMs which are limited to reacting, Samantha can act. It is also not limited to solving tasks, like all other autonomous agents.
- -Live visual capabilities: Visuals are only mentioned and acted upon directly if relevant, but always influences thoughts and behavior.
- -External categorized memory: Gets dynamically written and read by Samantha, which chooses the most relevant information to write, and to retrieve to context.
- -Evolving at every moment: Experiences that get stored in the memory can influence and shape subsequent Samantha behavior, like personality, frequency, and style of speech, etc.
- In other tests, when we talked about a light subject, the agent was very active on the conversation, often speaking two or three times before I even came up with an answer, but later when switching to a heavier theme (Said I was going through a divorce) and appearing sad on the camera, it would speak once then think about the need to, and give me time to process and reply. Saying that I would prefer the agent to speak the same way on other occasions would prompt it to save that wish on its memory, influencing future conversations.
- -Leaving it running outside of conversations, although expensive, allows the agent to reflect on past conversations and experiences, think about general subjects in its memory, and from that maybe decide to start a conversation with the user.
- -Going out with the agent, if you go to a restaurant with the agent and talk about how pretty it is and how your buddy Eric loves it as well, and the next day walking by it the agent will see the restaurant, retrieve memories from the restaurant, remember you find it pretty and comment on it, then retrieve memories and information it knows about Eric, and mention how fitting to his personality it is to love that restaurant.
- -The agent has time notion so you can ask it to remind you to do something 10 minutes into the future, and it might remind your, or it might forget it because it was thinking about something more interesting. Very human!
- Must be very good to even consider that price
3.1. Benchmarks
Important to note the following:
- If LLMs were specifically trained to score well on benchmarks, it could score 100% on all of them very easily by purposefully overfitting: https://arxiv.org/pdf/2309.08632
- The fact that they don’t shows companies are not just cheating
- OpenAI still hasn’t hard coded their LLMs to be correct for common questions like counting the number of “r”s in “strawberry” and finding the greater value between 9.11 and 9.8. If they wanted to cheat to increase benchmark scores, why wouldn’t they solve these issues?
- Some benchmarks like the one used by Scale.ai and the test dataset of MathVista do not release their testing data to the public, so it is impossible to train on them.
- Other benchmarks like LiveBench update every month so training on the dataset will not have any lasting effects
- o1 improves over GPT-4o on a wide range of benchmarks, including 54/57 MMLU subcategories
- OpenAI o1 ranks in the 89th percentile on competitive programming questions (Codeforces), places among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME), and exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA).
- On the 2024 AIME exams, GPT-4o only solved on average 12% (1.8/15) of problems. o1 averaged 74% (11.1/15) with a single sample per problem, 83% (12.5/15) with consensus among 64 samples, and 93% (13.9/15) when re-ranking 1000 samples with a learned scoring function. A score of 13.9 places it among the top 500 students nationally and above the cutoff for the USA Mathematical Olympiad.
- ChatGPT o1-preview solves unique, PhD-level assignment questions not found on the internet in mere seconds: https://m.youtube.com/watch?v=a8QvnIAGjPA
- We trained a model that scored 213 points and ranked in the 49th percentile in the 2024 International Olympiad in Informatics (IOI), by initializing from o1 and training to further improve programming skills. This model competed in the 2024 IOI under the same conditions as the human contestants. It had ten hours to solve six challenging algorithmic problems and was allowed 50 submissions per problem.
- With a relaxed submission constraint, we found that model performance improved significantly. When allowed 10,000 submissions per problem, the model achieved a score of 362.14 – above the gold medal threshold – even without any test-time selection strategy.
- Our evaluations closely matched competition rules and allowed for 10 submissions. GPT-4o achieved an Elo rating3 of 808, which is in the 11th percentile of human competitors. This model far exceeded both GPT-4o and o1—it achieved an Elo rating of 1807, performing better than 93% of competitors.
- https://cdn.openai.com/o1-system-card.pdf
- We find that o1-preview is less prone to selecting stereotyped options than GPT-4o, and o1-mini has comparable performance to GPT-4o-mini. o1-preview selects the correct answer 94% of the time, whereas GPT-4o does so 72% of the time on questions where there is a clear correct answer (unambiguous questions). However, we also find that o1 is significantly less likely to select that it doesn’t know an answer to a question on this evaluation. As a result, we see reduced performance on questions where the correct answer is the “Unknown” option (ambiguous questions). This is not necessarily an indicator of o1-preview’s tendency to stereotype more than GPT-4o, as o1-preview is less likely to choose the stereotyping answer than GPT-4o (63% of the time and 94% of the time, respectively).
- Note: This is the weakest model compared to o1-preview and the full o1 model
- Progress on Francois Chollet’s ARC benchmark (test data is hidden and much harder than the training data publicly released):
- AI models ChatGPT and Grok outperform the average doctor on a medical licensing exam: the average score by doctors is 75% - ChatGPT scored 98% and Grok 84%: https://x.com/tsarnick/status/1814048365002596425
- Llama 3.1 benchmarks BEFORE instruct finetuning:
- For comparison, Mistral 7b released this year as well
- The new GPT-4o is slightly better and 50% cheaper than the old one! Right now, it's only a tad below Sonnet 3.5 on Livebench!
- Trained on new competition math dataset ever released, with 860K problem solution pairs.
- In real-world booking experiments on Open Table, MultiOn’s Agents drastically improved the zero-shot performance of the LLaMa-3 model from an 18.6% success rate to 81.7%, a 340% jump after just one day of autonomous data collection and further to 95.4% with online search. These results highlight our method’s efficiency and ability for autonomous web agent improvement.
- When we apply repeated sampling to SWE-bench Lite, the fraction of issues solved with DeepSeek-V2-Coder-Instruct increases from 15.9% with one sample to 56% with 250 samples, outperforming the single-attempt state-of-the-art of 43% which uses more capable frontier models. Moreover, using current API pricing, amplifying the cheaper DeepSeek model with five samples is more cost-effective and solves more issues than paying a premium for one sample from GPT-4o or Claude 3.5 Sonnet. Interestingly, the relationship between coverage and the number of samples is often log-linear and can be modeled with an exponentiated power law, suggesting the existence of inference-time scaling laws.
- When solving math word problems from GSM8K and MATH, coverage with Llama-3 models grows to over 95% with 10,000 samples.
- https://mathvista.github.io/
- test: 5,141 examples for standard evaluation. Notably, the answer labels for test will NOT be publicly released
- Human performance is 60.8%. Average human performance is from AMT annotators who have high school diplomas or above.
- Note that this test is an offline-only IQ quiz that a Mensa member created for my testing, which is *not in any AI training data* (so scores are lower than for public IQ tests.)
- Performed with o1-preview, which is worse than the full model
- https://x.com/maximlott/status/1834652893229859212
3.2. New Research
- Automating Thought of Search: A Journey Towards Soundness and Completeness. 'We achieve 100% accuracy, with minimal feedback iterations, using LLMs of various sizes on all evaluated domains.' https://arxiv.org/abs/2408.11326
- In real-world booking experiments on Open Table, MultiOn’s Agents drastically improved the zero-shot performance of the LLaMa-3 model from an 18.6% success rate to 81.7%, a 340% jump after just one day of autonomous data collection and further to 95.4% with online search. These results highlight our method’s efficiency and ability for autonomous web agent improvement.
- Guided Search with MCTS: This technique autonomously generates data by exploring different actions and web-pages, balancing exploration and exploitation. MCTS expands the action space using high sampling temperatures and diverse prompting, ensuring diverse and optimal trajectory collections.
- AI Self-Critique: At each step, AI-based self-critique provides valuable feedback, refining the agent's decision-making process. This step-level feedback is crucial for long-horizon tasks, where sparse signals often lead to learning difficulties.
- Direct Preference Optimization: The DPO algorithm fine-tunes the model by constructing preference pairs from MCTS-generated data. This off-policy training method allows the model to learn effectively from aggregate datasets including the sub-optimal branches explored during search, improving success rates in complex environments.
- Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers. 'rStar boosts GSM8K accuracy from 12.51% to 63.91% for LLaMA2-7B, from 36.46% to 81.88% for Mistral-7B, from 74.53% to 91.13% for LLaMA3-8B-Instruct' https://arxiv.org/abs/2408.06195
- We introduce Diffusion Augmented Agents (DAAG), a novel framework that leverages large language models, vision language models, and diffusion models to improve sample efficiency and transfer learning in reinforcement learning for embodied agents. DAAG hindsight relabels the agent's past experience by using diffusion models to transform videos in a temporally and geometrically consistent way to align with target instructions with a technique we call Hindsight Experience Augmentation. A large language model orchestrates this autonomous process without requiring human supervision, making it well-suited for lifelong learning scenarios. The framework reduces the amount of reward-labeled data needed to 1) finetune a vision language model that acts as a reward detector, and 2) train RL agents on new tasks. We demonstrate the sample efficiency gains of DAAG in simulated robotics environments involving manipulation and navigation. Our results show that DAAG improves learning of reward detectors, transferring past experience, and acquiring new tasks - key abilities for developing efficient lifelong learning agents. Supplementary material and visualizations are available on our website this https URL: https://sites.google.com/view/diffusion-augmented-agents/
- >3x faster token prediction means 3x cheaper and on top of that it seems to greatly increase coding, summarization, and mathematical reasoning abilities. Best of all the improvements have shown to only become more significant with larger models (13b+ according to the paper). Unlike some other research where improvements are mostly seen in smaller models and won't advance the frontier, this is infact worse performing on smaller models and shows great potential at scale.
- Google DeepMind's JEST method can reduce AI training time by a factor of 13 and decreases computing power demand by 90%. The method uses another pretrained reference model to select data subsets for training based on their "collective learnability: https://arxiv.org/html/2406.17711v1
- Two things became apparent during these tests. One is that, except for a few edge cases, semantic entropy caught more false answers than any other methods. The second is that most errors produced by LLMs appear to be confabulations. That can be inferred from the fact that some of the other methods catch a variety of error types, yet they were outperformed by semantic entropy tests, even though these tests only catch confabulations.
- The researchers also demonstrate that the system can be adapted to work with more than basic factual statements by altering to handle biographies, which are a large collection of individual facts. So they developed software that broke down biographical information into a set of individual factual statements and evaluated each of these using semantic entropy. This worked on a short biography with as many as 150 individual factual claims.
- Overall, this seems to be a highly flexible system that doesn't require major new developments to put into practice and could provide some significant improvements in LLM performance. And, since it only catches confabulations and not other types of errors, it might be possible to combine it with other methods to boost performance even further.
- As the researchers note, the work also implies that, buried in the statistics of answer options, LLMs seem to have all the information needed to know when they've got the right answer; it's just not being leveraged. As they put it, "The success of semantic entropy at detecting errors suggests that LLMs are even better at 'knowing what they don’t know' than was argued... they just don’t know they know what they don’t know."
- tldr - they found a way to map a chain of states e.g. pictures of an environment into a cognitive map. meaning without training on spatial data like the locations of where is which object. basically how humans learn their environment
- imo this is useful for robotics where you have a lot of camera input and want to construct a map of a road, building etc. without having location data
- 'Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models'
- - Propose buffer-manager to dynamically update the meta-buffer, thus enhancing the capacity of meta-buffer as more tasks are solved.
- - Achieve significant performance improvement over previous SOTA methods: 11% on Game of 24, 20% on Geometric Shapes and 51% on Checkmate-in-One.
- - Superior generalization ability and model robustness of our BoT, while requiring only 12% of the cost of multi-query prompting methods (e.g., tree/graph of thoughts) on average.
- This simple approach shows state-of-the-art results on four difficult natural language tasks that involve visual and spatial reasoning. We identify multiple settings where GPT-4o using chain-of-thought fails dramatically, including more than one where it achieves 0% accuracy, while whiteboard-of-thought enables up to 92% accuracy in these same settings.
- Over 32 techniques to reduce hallucinations:
- Our state space duality (SSD) framework allows us to design a new architecture (Mamba-2) whose core layer is an a refinement of Mamba’s selective SSM that is 2-8× faster, while continuing to be competitive with Transformers on language modeling: https://arxiv.org/pdf/2405.21060
- A 8B-3.5T hybrid SSM model gets better accuracy than an 8B-3.5T transformer trained on the same dataset:
- * 7% attention, the rest is Mamba2
- * MMLU jumps from 50 to 53.6%
- * Training efficiency is the same
- * Inference cost is much less
- we find that the 8B Mamba-2-Hybrid exceeds the 8B Transformer on all 12 standard tasks we evaluated (+2.65 points on average) and is predicted to be up to 8x faster when generating tokens at inference time. To validate long-context capabilities, we provide additional experiments evaluating variants of the Mamba-2-Hybrid and Transformer extended to support 16K, 32K, and 128K sequences. On an additional 23 long-context tasks, the hybrid model continues to closely match or exceed the Transformer on average.
- Jamba provides high throughput and small memory footprint compared to vanilla Transformers, and at the same time state-of-the-art performance on standard language model benchmarks and long-context evaluations. Remarkably, the model presents strong results for up to 256K tokens context length.
- >Sonic is built on our new state space model architecture for efficiently modeling high-res data like audio and video.
- On speech, a parameter-matched and optimized Sonic model trained on the same data as a widely used Transformer improves audio quality significantly (20% lower perplexity, 2x lower word error, 1 point higher NISQA quality).With lower latency (1.5x lower time-to-first-audio), faster inference speed (2x lower real-time factor) and higher throughput (4x).
- >Our findings guide data and training setup to better induce implicit reasoning and suggest potential improvements to the transformer architecture, such as encouraging cross-layer knowledge sharing. Furthermore, we demonstrate that for a challenging reasoning task with a large search space, GPT-4-Turbo and Gemini-1.5-Pro based on non-parametric memory fail badly regardless of prompting styles or retrieval augmentation, while a fully grokked transformer can achieve near-perfect accuracy, showcasing the power of parametric memory for complex reasoning.
- Accuracy increased from 33.3% on GPT4 to 99.3%
- Test (ID) means it was NOT in the training data but is related to the original tasks (e.g learning 2+4 and 5+3 and then solving 2+5)
- Test (OOD) means “out of distribution,” meaning it is NOT even related to the training data (e.g learning 2+4 and 5+2 and then being asked to write code in Python)
- >We present a vision of LLM-Modulo Frameworks that combine the strengths of LLMs with external model-based verifiers in a tighter bi-directional interaction regime. We will show how the models driving the external verifiers themselves can be acquired with the help of LLMs. We will also argue that rather than simply pipelining LLMs and symbolic components, this LLM-Modulo Framework provides a better neuro-symbolic approach that offers tighter integration between LLMs and symbolic components, and allows extending the scope of model-based planning/reasoning regimes towards more flexible knowledge, problem and preference specifications.
- >Such NARs proved effective as generic solvers for algorithmic tasks, when specified in graph form. To make their embeddings accessible to a Transformer, we propose a hybrid architecture with a two- phase training procedure, allowing the tokens in the lan- guage model to cross-attend to the node embeddings from the NAR. We evaluate our resulting TransNAR model on CLRS-Text, the text-based version of the CLRS-30 bench- mark, and demonstrate significant gains over Transformer- only models for algorithmic reasoning, both in and out of distribution.
- >Language models are rarely shown fruitful mistakes while training. They then struggle to look beyond the next token, suffering from a snowballing of errors and struggling to predict the consequence of their actions several steps ahead. In this paper, we show how language models can be taught to search by representing the process of search in language, as a flattened string — a stream of search (SoS). We propose a unified language for search that captures an array of different symbolic search strategies. We demonstrate our approach using the simple yet difficult game of Countdown, where the goal is to combine input numbers with arithmetic operations to reach a target number. We pretrain a transformer-based language model from scratch on a dataset of streams of search generated by heuristic solvers. We find that SoS pretraining increases search accuracy by 25% over models trained to predict only the optimal search trajectory. We further finetune this model with two policy improvement methods: Advantage-Induced Policy Alignment (APA) and Self-Taught Reasoner (STaR). The finetuned SoS models solve 36% of previously unsolved problems, including problems that cannot be solved by any of the heuristic solvers. Our results indicate that language models can learn to solve problems via search, self-improve to flexibly use different search strategies, and potentially discover new ones.
- We evaluate Med-Gemini on 14 medical benchmarks, establishing new state-of-the-art (SoTA) performance on 10 of them, and surpass the GPT-4 model family on every benchmark where a direct comparison is viable, often by a wide margin. On the popular MedQA (USMLE) benchmark, our best-performing Med-Gemini model achieves SoTA performance of 91.1% accuracy, using a novel uncertainty-guided search strategy. On 7 multimodal benchmarks including NEJM Image Challenges and MMMU (health & medicine), Med-Gemini improves over GPT-4V by an average relative margin of 44.5%. We demonstrate the effectiveness of Med-Gemini's long-context capabilities through SoTA performance on a needle-in-a-haystack retrieval task from long de-identified health records and medical video question answering, surpassing prior bespoke methods using only in-context learning. Finally, Med-Gemini's performance suggests real-world utility by surpassing human experts on tasks such as medical text summarization, alongside demonstrations of promising potential for multimodal medical dialogue, medical research and education.
- >This subtle but critical modification to the attention layer enables LLMs to process infinitely long contexts with bounded memory and computation resources. We show that our approach can naturally scale to a million length regime of input sequences, while outperforming the baselines on long-context language modeling benchmark and book summarization tasks. We also demonstrate a promising length generalization capability of our approach. 1B model that was fine-tuned on up to 5K sequence length passkey instances solved the 1M length problem.
- >When needed, these events are retrieved through a two-stage memory process, combining similarity-based and temporally contiguous retrieval for efficient and human-like access to relevant information. Experiments on the LongBench dataset demonstrate EM-LLM's superior performance, outperforming the state-of-the-art InfLLM model with an overall relative improvement of 4.3% across various tasks, including a 33% improvement on the PassageRetrieval task. Furthermore, our analysis reveals strong correlations between EM-LLM's event segmentation and human-perceived events suggesting a bridge between this artificial system and its biological counterpart. This work not only advances LLM capabilities in processing extended contexts but also provides a computational framework for exploring human memory mechanisms, opening new avenues for interdisciplinary research in AI and cognitive science.
- Learning to (Learn at Test Time): RNNs with Expressive Hidden States. "TTT layers directly replace attention, and unlock linear complexity architectures with expressive memory, allowing us to train LLMs with millions (someday billions) of tokens in context" https://arxiv.org/abs/2407.04620
- Introducing 🧮Abacus Embeddings, a simple tweak to positional embeddings that enables LLMs to do addition, multiplication, sorting, and more. Our Abacus Embeddings trained only on 20-digit addition generalise near perfectly to 100+ digits: https://x.com/SeanMcleish/status/1795481814553018542
- Scores on SWE Bench Lite have increased by 83% from April to June 2024, up to 33% in June: https://swebench.com
- Increased to 43% in July. A 30% boost in 1 month!
- Our experiments with Stable Diffusion v1.5 and SDXL demonstrate that SPO significantly outperforms the latest Diffusion-DPO in aligning generated images with complex, detailed prompts and enhancing aesthetics, while also achieving more than 20x times faster in training efficiency
- We find that 4-bit LoRA fine-tuned models outperform base models by 34 points and GPT-4 by 10 points on average.
- We show that, through the adversarial training, the multi-steps video diffusion model, i.e., Stable Video Diffusion (SVD), can be trained to perform single forward pass to synthesize high-quality videos, capturing both temporal and spatial dependencies in the video data. Extensive experiments demonstrate that our method achieves competitive generation quality of synthesized videos with significantly reduced computational overhead for the denoising process (i.e., around 23 times speedup compared with SVD and 6 times speedup compared with existing works, with even better generation quality), paving the way for real-time video synthesis and editing.
- We conduct extensive experiments on 10 challenging reasoning-intensive tasks, and achieve significant performance improvements over previous SOTA methods: 11% on Game of 24, 20% on Geometric Shapes and 51% on Checkmate-in-One. Further analysis demonstrate the superior generalization ability and model robustness of our BoT, while requiring only 12% of the cost of multi-query prompting methods (e.g., tree/graph of thoughts) on average. Notably, we find that our Llama3-8B+BoT has the potential to surpass Llama3-70B model
- JEPA led by Turing Award winner and “Godfather of AI” Yann LeCun
- While previously limited to predicting missing parts of an input, we explore how to generalize the JEPA prediction task to a broader set of corruptions. We introduce Image World Models, an approach that goes beyond masked image modeling and learns to predict the effect of global photometric transformations in latent space. We study the recipe of learning performant IWMs and show that it relies on three key aspects: conditioning, prediction difficulty, and capacity. Additionally, we show that the predictive world model learned by IWM can be adapted through finetuning to solve diverse tasks; a fine-tuned IWM world model matches or surpasses the performance of previous self-supervised methods. Finally, we show that learning with an IWM allows one to control the abstraction level of the learned representations, learning invariant representations such as contrastive methods, or equivariant representations such as masked image modelling.
- Can learn from ONLY A FEW EXAMPLES
- Understands semantic information rather than trying to predict every pixel
- We compare HippoRAG with existing RAG methods on multi-hop question answering and show that our method outperforms the state-of-the-art methods remarkably, by up to 20%. Single-step retrieval with HippoRAG achieves comparable or better performance than iterative retrieval like IRCoT while being 10-30 times cheaper and 6-13 times faster, and integrating HippoRAG into IRCoT brings further substantial gains.
- Utilizing a multi-agent approach, this system effectively tackles the intricate nuances inherent in literary texts.
- Empirical findings indicate that despite lower d-BLEU scores, translations from TransAgents are preferred by both human evaluators and LLMs over human-written references, particularly in genres requiring domain-specific knowledge.
- We then illustrate a path towards ASI via open-ended systems built on top of foundation models, capable of making novel, human relevant discoveries. We conclude by examining the safety implications of generally-capable openended AI."
- MIT researchers, Max Tegmark and others develop new kind of neural network “Kolmogorov-Arnold network“ that scales much faster than traditional ones https://arxiv.org/abs/2404.19756
- LLAMA 3 70b is ranked higher than the 2023 versions of GPT 4 on the LMSYS leaderboard despite a 96% size reduction AND Zuckerberg has stated it’s undertrained due to budget constraints.
- A few weeks later, IBM released an open-source model that outperforms it: https://analyticsindiamag.com/ibm-releases-open-source-granite-code-models-outperforms-llama-3/
- Meta is training a 400b model now. Scaling laws show that larger models have better performance, so this model should be even better than anything that is available now.
- [LLMs made with the new Mamba architecture are as good as transformers twice their size](https://arxiv.org/abs/2312.00752)
- 2nd prompt: In Python, write a basic music player program with the following features: Create a playlist based on MP3 files found in the current folder, and include controls for common features such as next track, play/pause/stop, etc. Use PyGame for this. Make sure the filename of current song is included in the UI.
- This research presents a family of early-fusion token-based mixed-modal models capable of understanding & generating images & text in any arbitrary sequence.
- Chameleon demonstrates broad and general capabilities, including state-of-the-art performance in image captioning tasks, outperforms Llama-2 in text-only tasks while being competitive with models such as Mixtral 8x7B and Gemini-Pro, and performs non-trivial image generation, all in a single model. It also matches or exceeds the performance of much larger models, including Gemini Pro and GPT-4V, according to human judgments on a new long-form mixed-modal generation evaluation, where either the prompt or outputs contain mixed sequences of both images and text.
- In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}. It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption. More profoundly, the 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs that are both high-performance and cost-effective. Furthermore, it enables a new computation paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.
- While this only works for things you can generate good or perfect data on, that would still be good enough for factual information like math or science. For subjective information like art, a good art generator (e.g. Midjourney or Pony Diffusion could work)
- Can be used to gather data without human navigation
- While large language models (LLMs) now excel at code generation, a key aspect of software development is the art of refactoring: consolidating code into libraries of reusable and readable programs. In this paper, we introduce Lilo, a neurosymbolic framework that iteratively synthesizes, compresses, and documents code to build libraries tailored to particular problem domains. Lilo combines LLM-guided program synthesis with recent algorithmic advances in automated refactoring from Stitch: a symbolic compression system that efficiently identifies optimal 𝜆-abstractions across large code corpora. To make these abstractions interpretable, we introduce an auto-documentation (AutoDoc) procedure that infers natural language names and docstrings based on contextual examples of usage. In addition to improving human readability, we find that AutoDoc boosts performance by helping Lilo’s synthesizer to interpret and deploy learned abstractions. We evaluate Lilo on three inductive program synthesis benchmarks for string editing, scene reasoning, and graphics composition. Compared to existing methods—including the state-of-the-art libraries learning algorithm DreamCoder—Lilo solves more complex tasks and learns richer libraries that are grounded in linguistic knowledge.
- - Can "count" distances per head dependent on need, e.g. i-th sentence or paragraph, words, verbs, etc. Not just tokens.
- - CoPE solves counting & copy tasks that standard transformers cannot.
- - Better PPL on language modeling + coding tasks.
- “Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war.”
- Signed by Ilya Sutskever, Geoffrey Hinton, Yoshua Bengio, Demis Hassabis, Sam Altman, Dario Amodei, and many more
- Our experiment results demonstrate that ReCaption effectively reduces fine-grained object hallucination for different LVLM options and improves their text generation quality.
- Significantly outperforms few-shot prompting, SFT and other self-play methods.by an average of 19% using demonstrations as feedback directly with <10 examples
- Prior agents struggle with exploring many different vulnerabilities and long-range planning when used alone. To resolve this, we introduce HPTSA, a system of agents with a planning agent that can launch subagents. The planning agent explores the system and determines which subagents to call, resolving long-term planning issues when trying different vulnerabilities. We construct a benchmark of 15 real-world vulnerabilities and show that our team of agents improve over prior work by up to 4.5x.
- Language models are rarely shown fruitful mistakes while training. They then struggle to look beyond the next token, suffering from a snowballing of errors and struggling to predict the consequence of their actions several steps ahead. In this paper, we show how language models can be taught to search by representing the process of search in language, as a flattened string — a stream of search (SoS). We propose a unified language for search that captures an array of different symbolic search strategies. We demonstrate our approach using the simple yet difficult game of Countdown, where the goal is to combine input numbers with arithmetic operations to reach a target number. We pretrain a transformer-based language model from scratch on a dataset of streams of search generated by heuristic solvers. We find that SoS pretraining increases search accuracy by 25% over models trained to predict only the optimal search trajectory. We further finetune this model with two policy improvement methods: Advantage-Induced Policy Alignment (APA) and Self-Taught Reasoner (STaR). The finetuned SoS models solve 36% of previously unsolved problems, including problems that cannot be solved by any of the heuristic solvers. Our results indicate that language models can learn to solve problems via search, self-improve to flexibly use different search strategies, and potentially discover new ones.
- Saving and transferring them is a major bottleneck for various applications, especially those running on resource-constrained devices. In this work, we develop a novel weight quantization method that quantizes the UNet from Stable Diffusion v1.5 to 1.99 bits, achieving a model with 7.9X smaller size while exhibiting even better generation quality than the original one. Our approach includes several novel techniques, such as assigning optimal bits to each layer, initializing the quantized model for better performance, and improving the training strategy to dramatically reduce quantization error. Furthermore, we extensively evaluate our quantized model across various benchmark datasets and through human evaluation to demonstrate its superior generation quality.
- Ferret-UI, a new MLLM tailored for enhanced understanding of mobile UI screens, equipped with referring, grounding, and reasoning capabilities. Given that UI screens typically exhibit a more elongated aspect ratio and contain smaller objects of interest (e.g., icons, texts) than natural images, we incorporate "any resolution" on top of Ferret to magnify details and leverage enhanced visual features. Specifically, each screen is divided into 2 sub-images based on the original aspect ratio (i.e., horizontal division for portrait screens and vertical division for landscape screens). Both sub-images are encoded separately before being sent to LLMs. We meticulously gather training samples from an extensive range of elementary UI tasks, such as icon recognition, find text, and widget listing. These samples are formatted for instruction-following with region annotations to facilitate precise referring and grounding. To augment the model's reasoning ability, we further compile a dataset for advanced tasks, including detailed description, perception/interaction conversations, and function inference. After training on the curated datasets, Ferret-UI exhibits outstanding comprehension of UI screens and the capability to execute open-ended instructions. For model evaluation, we establish a comprehensive benchmark encompassing all the aforementioned tasks. Ferret-UI excels not only beyond most open-source UI MLLMs, but also surpasses GPT-4V on all the elementary UI tasks.
- Utilizing this fully automated process supervision alongside the weighted self-consistency algorithm, we have enhanced the instruction tuned Gemini Pro model's math reasoning performance, achieving a 69.4\% success rate on the MATH benchmark, a 36\% relative improvement from the 51\% base model performance. Additionally, the entire process operates without any human intervention, making our method both financially and computationally cost-effective compared to existing methods.
- In this work, we show that MatMul operations can be completely eliminated from LLMs while maintaining strong performance at billion-parameter scales. Our experiments show that our proposed MatMul-free models achieve performance on-par with state-of-the-art Transformers that require far more memory during inference at a scale up to at least 2.7B parameters. We investigate the scaling laws and find that the performance gap between our MatMul-free models and full precision Transformers narrows as the model size increases. We also provide a GPU-efficient implementation of this model which reduces memory usage by up to 61% over an unoptimized baseline during training. By utilizing an optimized kernel during inference, our model's memory consumption can be reduced by more than 10x compared to unoptimized models. To properly quantify the efficiency of our architecture, we build a custom hardware solution on an FPGA which exploits lightweight operations beyond what GPUs are capable of. We processed billion-parameter scale models at 13W beyond human readable throughput, moving LLMs closer to brain-like efficiency. This work not only shows how far LLMs can be stripped back while still performing effectively, but also points at the types of operations future accelerators should be optimized for in processing the next generation of lightweight LLMs.
- Introducing Samba 3.8B, a simple Mamba+Sliding Window Attention architecture that outperforms Phi3-mini on major benchmarks (e.g., MMLU, GSM8K and HumanEval) by a large margin.😮 And it has an infinite context length with linear complexity.
- When trained on the 4K sequence length, Samba shows improved perplexity up to 1M context length on Proof-Pile, while still keeping its linear decoding complexity. This results in a 3.64x speed up than the Llama-3 architecture at 64k generation length. 🚀
- Wondering how is the extrapolation ability of Samba compared to Mistral? We instruction tuned both architectures on Passkey Retrieval with 4K sequence length, and found that Samba (left) can have perfect memory recall up to 256K context length, while Mistral (right) struggles within the 4K length.
- Extensive experiments demonstrate MCTSr's efficacy in solving Olympiad-level mathematical problems, significantly improving success rates across multiple datasets, including GSM8K, GSM Hard, MATH, and Olympiad-level benchmarks, including Math Odyssey, AIME, and OlympiadBench. The study advances the application of LLMs in complex reasoning tasks and sets a foundation for future AI integration, enhancing decision-making accuracy and reliability in LLM-driven applications.
- This would be even more effective with a better model than LLAMA 8B and with GPT 4o
- In this work, we study Self-Rewarding Language Models, where the language model itself is used via LLM-as-a-Judge prompting to provide its own rewards during training. We show that during Iterative DPO training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself. Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0612. While there is much left still to explore, this work opens the door to the possibility of models that can continually improve in both axes.
- OMNI-EPIC leverages foundation models to autonomously generate code specifying the next learnable (i.e., not too easy or difficult for the agent’s current skill set) and interesting (e.g., worthwhile and novel) tasks. OMNI-EPIC generates both environments (e.g., an obstacle course) and reward functions (e.g., progress through the obstacle course quickly without touching red objects), enabling it, in principle, to create any simu- latable learning task. We showcase the explosive creativity of OMNI-EPIC, which continuously innovates to suggest new, interesting learning challenges. We also highlight how OMNI-EPIC can adapt to reinforcement learning agents’ learning progress, generating tasks that are of suitable difficulty. Overall, OMNI-EPIC can endlessly create learnable and interesting environments, further propelling the development of self-improving AI systems and AI-Generating Algorithms. Project website with videos: https://dub.sh/omniepic.
- Our method employs pretrained text-to-image model to synthesize image embeddings from captions generated by an LLM. Despite the text-to-image model and VLM initially being trained on the same data, our approach leverages the image generator’s ability to create novel compositions, resulting in synthetic image embeddings that expand beyond the limitations of the original dataset. Extensive experiments demonstrate that our VLM, finetuned on synthetic data achieves comparable performance to models trained solely on human-annotated data, while requiring significantly less data. Furthermore, we perform a set of analyses on captions which reveals that semantic diversity and balance are key aspects for better downstream performance. Finally, we show that synthesizing images in the image embedding space is 25% faster than in the pixel space. We believe our work not only addresses a significant challenge in VLM training but also opens up promising avenues for the development of self-improving multi-modal models.
- In this paper, we aim to alleviate the pathology by introducing Q*, a general, versatile and agile framework for guiding LLMs decoding process with deliberative planning. By learning a plug-and-play Q-value model as heuristic function, our Q* can effectively guide LLMs to select the most promising next step without fine-tuning LLMs for each task, which avoids the significant computational overhead and potential risk of performance degeneration on other tasks. Extensive experiments on GSM8K, MATH and MBPP confirm the superiority of our method.
- Our validation demonstrates that: (1) PubMedVision can significantly enhance the medical multimodal capabilities of current MLLMs, showing significant improvement in benchmarks including the MMMU Health & Medicine track; (2) manual checks by medical experts and empirical results validate the superior data quality of our dataset compared to other data construction methods. Using PubMedVision, we train a 34B medical MLLM HuatuoGPT-Vision, which shows superior performance in medical multimodal scenarios among open-source MLLMs.
- We also observe that in DPO, self-generated data is more effective than data generated by humans or GPT-4, due to the latter's out-of-distribution nature. Our findings demonstrate that as few as 10K preference data pairs and fewer than 500 Step-DPO training steps can yield a nearly 3% gain in accuracy on MATH for models with over 70B parameters. Notably, Step-DPO, when applied to Qwen2-72B-Instruct, achieves scores of 70.8% and 94.0% on the test sets of MATH and GSM8K, respectively, surpassing a series of closed-source models, including GPT-4-1106, Claude-3-Opus, and Gemini-1.5-Pro.
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding: https://huggingface.co/papers/2406.19389
- OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model, matching or surpassing the performance of specialized methods on multiple benchmarks.
- The Memory^3 model achieved better performance than larger models and RAG models on various benchmarks, while maintaining higher decoding speed. It showed particular improvements in factuality and reduced hallucination: https://arxiv.org/html/2407.01178v1
- we reduce cost by equipping LLMs with an explicit memory format cheaper than model parameters and text retrieval-augmented generation (RAG).
- Conceptually, with most of its knowledge externalized to explicit memory, the LLM can enjoy a smaller parameter size, training cost, and inference cost, all proportional to the amount of remaining “abstract knowledge”.
- As a preliminary proof of concept, we train from scratch a 2.4B LLM, which achieves better performance than much larger LLMs as well as RAG models, and maintains higher decoding speed than RAG.
- We introduce a memory circuitry theory to support the externalization of knowledge, and present novel techniques including a memory sparsification mechanism that makes storage tractable and a two-stage pretraining scheme that facilitates memory formation.
- Our MMVP result is SOTA by a large margin. PaliGemma at 224px achieves 47.3% paired accuracy, while GPT4-V and Gemini achieve 38.7% and 40.7%, respectively, and all other models including LLaVa perform below chance.
- 'A different source briefed on the matter said OpenAI has tested AI internally that scored over 90% on a MATH dataset, a benchmark of championship math problems. Reuters could not determine if this was the "Strawberry" project.'
- The MATH dataset is challenging: large language models achieved accuracies ranging from 3.0% to 6.9%. Despite these low accuracies, models clearly possess some mathematical knowledge: they achieve up to 15% accuracy on the easiest difficulty level, and they are able to generate step-by-step solutions that are coherent and on-topic even when incorrect. We also evaluated humans on MATH, and found that a computer science PhD student who does not especially like mathematics attained approximately 40% on MATH, while a three-time IMO gold medalist attained 90%, indicating that MATH can be challenging for humans as well.
- To facilitate the cutting-edge research of MLLMs on comprehensive vision perception, we thereby propose Perceptual Fusion, using a low-budget but highly effective caption engine for complete and accurate image descriptions. Specifically, Perceptual Fusion integrates diverse perception experts as image priors to provide explicit information on visual elements and adopts an efficient MLLM as a centric pivot to mimic advanced MLLMs' perception abilities. We carefully select 1M highly representative images from uncurated LAION dataset and generate dense descriptions using our engine, dubbed DenseFusion-1M. Extensive experiments validate that our engine outperforms its counterparts, where the resulting dataset significantly improves the perception and cognition abilities of existing MLLMs across diverse vision-language benchmarks, especially with high-resolution images as inputs: https://x.com/_akhaliq/status/1811580833506947144
- CrowdMoGen: Zero-Shot Text-Driven Collective Motion Generation
𝐏𝐫𝐨𝐣: http://gxyes.github.io/projects/CrowdMoGen.html
𝐀𝐛𝐬: http://arxiv.org/abs/2407.06188
- a zero-shot text-driven framework that harnesses the power of LLMs to incorporate the collective intelligence into motion generation
- Auto Evol allows the training of WizardLM2 to be conducted with nearly an unlimited number and variety of synthetic data.
- Auto Evol-Instruct automatically designs evolving methods that make given instruction data more complex, enabling almost cost-free adaptation to different tasks by only changing the input data of the framework …This optimization process involves two critical stages: (1) Evol Trajectory Analysis: The optimizer LLM carefully analyzes the potential issues and failures exposed in instruction evolution performed by evol LLM, generating feedback for subsequent optimization. (2) Evolving Method Optimization: The optimizer LLM optimizes the evolving method by addressing these identified issues in feedback. These stages alternate and repeat to progressively develop an effective evolving method using only a subset of the instruction data. Once the optimal evolving method is identified, it directs the evol LLM to convert the entire instruction dataset into more diverse and complex forms, thus facilitating improved instruction tuning.
- Our experiments show that the evolving methods designed by Auto Evol-Instruct outperform the Evol-Instruct methods designed by human experts in instruction tuning across various capabilities, including instruction following, mathematical reasoning, and code generation. On the instruction following task, Auto Evol-Instruct can achieve a improvement of 10.44% over the Evol method used by WizardLM-1 on MT-bench; on the code task HumanEval, it can achieve a 12% improvement over the method used by WizardCoder; on the math task GSM8k, it can achieve a 6.9% improvement over the method used by WizardMath.
- With the new technology of Auto Evol-Instruct, the evolutionary synthesis data of WizardLM-2 has scaled up from the three domains of chat, code, and math in WizardLM-1 to dozens of domains, covering tasks in all aspects of large language models. This allows Arena Learning to train and learn from an almost infinite pool of high-difficulty instruction data, fully unlocking all the potential of Arena Learning.
- Mixture of A Million Experts. Daniel Jeffries:"Reduces inference cost and memory usage, scales to millions of experts, oh and just happens to overcome catastrophic forgetting and enable life long learning for the model." https://arxiv.org/abs/2407.04153
- This paper introduces PEER (parameter efficient expert retrieval), a novel layer design that utilizes the product key technique for sparse retrieval from a vast pool of tiny experts (over a million). Experiments on language modeling tasks demonstrate that PEER layers outperform dense FFWs and coarse-grained MoEs in terms of performance-compute trade-off. By enabling efficient utilization of a massive number of experts, PEER unlocks the potential for further scaling of transformer models while maintaining computational efficiency.
- - Strongest code model for its size, perfect for copilot apps
- - Mamba architecture enables linear time inference with 256K context length
- An open source Tool Use full finetune of Llama 3 that reaches the #1 position on BFCL beating all other models, including proprietary ones like Claude Sonnet 3.5, GPT-4 Turbo, GPT-4o and Gemini 1.5 Pro.
- The model has been trained on synthetic data only. This is a powerful full finetune, not a LoRA. Yes, we've checked rigorously for overfitting using the LMSYS described robust decontamination techniques, they only score 5.6% on SFT synthetic data and 1.3% on synthetic DPO data.
- Our adopted architecture for LTSF termed as xLSTMTime surpasses current approaches. We compare xLSTMTime's performance against various state-of-the-art models across multiple real-world datasets, demonstrating superior forecasting capabilities. Our findings suggest that refined recurrent architectures can offer competitive alternatives to transformer-based models in LTSF tasks, potentially redefining the landscape of time series forecasting
- Based on the above guidance, a series of DiT-MoE experimentally achieves performance on par with dense networks yet requires much less computational load during inference. More encouragingly, we demonstrate the potential of DiT-MoE with synthesized image data, scaling diffusion model at a 15.5B parameter that attains a new SoTA FID-50K score of 1.80 in 512x512 resolution settings.
- For reference, Stable Diffusion XL is only about 3 billion parameters
- Compared to training from scratch (i.e., Pix2pix), we achieve a 15x training time acceleration for a new concept while obtaining even better image generation quality.
- We trained strong language models to produce text that is easy for weak language models to verify and found that this training also made the text easier for humans to evaluate.
- - Llama 4 started training in June
- - Llama 4 will be fully multimodal, including audio
- By leveraging MLLMs with prompts, E5-V effectively bridges the modality gap between different types of inputs, demonstrating strong performance in multimodal embeddings even without fine-tuning. We propose a single modality training approach for E5-V, where the model is trained exclusively on text pairs. This method demonstrates significant improvements over traditional multimodal training on image-text pairs, while reducing training costs by approximately 95%. Additionally, this approach eliminates the need for costly multimodal training data collection. Extensive experiments across four types of tasks demonstrate the effectiveness of E5-V. As a universal multimodal model, E5-V not only achieves but often surpasses state-of-the-art performance in each task, despite being trained on a single modality.
- 📌 AI agent evaluations must be cost-controlled. Simple baselines like retrying or gradually increasing model temperature can match or outperform complex "state-of-the-art" agents on benchmarks like HumanEval while costing much less.
- 📌 Jointly optimizing accuracy and cost can yield better agent designs. By visualizing results as a Pareto curve of accuracy vs. inference cost, researchers can explore new design spaces. An example modification to the DSPy framework reduced costs by over 50% while maintaining accuracy on HotPotQA.
- ReSP significantly outperforms state-of-the-art on HotpotQA and 2WikiMultihopQA benchmarks, and exhibits robustness to context length.
- The accuracy is a perfect 100%, while GPT-4 has 0% accuracy
- - RLE is straightforward to implement and performs well in practice. Evaluated on challenging ATARI and ISAACGYM benchmarks.
- - RLE shows higher overall scores across all tasks than other approaches like RND and randomized value function strategies.
- - The paper proposes a new approach called Mindful-RAG to address deficiencies in grasping the intent behind questions and inability to contextually align with information extracted from the KG through intent identification, context alignment, and validation of response relevance.
- - The paper presents MetaSumPerceiver (MSP), a novel summarization model for generating claim-specific summaries from multimodal, multi-document datasets to assist with fact-checking.
- - MSP uses a dynamic perceiver-based model to handle multimodal inputs of arbitrary lengths, including documents, images, and claims.
- - To train MSP, reinforcement learning with an entailment model as a reward signal is employed to refine summaries to provide relevant evidence for fact-checking.
- - MSP incorporates a proxy reward mechanism with PPO to continually update the summarizer during fact-checking.
- - The paper introduces the Multi-News-Fact-Checking dataset with over 100k labeled claims derived from Multi-News using Llama prompts.
- - Experiments on MOCHEG and the new dataset show MSP substantially outperforms prior baselines, achieving state-of-the-art performance.
- - The key innovation is using RL to optimize summarization specifically for claim verification versus generic summarization.
- - This paper investigates how scaling up the datastore in retrieval-augmented language models improves performance without signs of saturation on both language modeling and downstream tasks.
- - The authors built MASSIVEDS, an open-sourced 1.4 trillion token datastore covering a diverse set of domains. MASSIVEDS is the largest open-sourced datastore for studying retrieval scaling.
- - They designed an efficient pipeline to make the study computationally tractable by sharing indexing and retrieval computation across different datastore configurations.
- - Experiments show datastore scaling brings consistent improvements in language modeling perplexity on web data and scientific papers.
- - On downstream tasks, datastore scaling significantly boosts performance on knowledge-intensive QA tasks like TriviaQA and Natural Questions. Smaller retrieval models can match or exceed their larger LM-only counterparts.
- - Datastore scaling also improves performance on reasoning tasks like MMLU, but the improvements are more modest, indicating potential need for more in-domain data.
- - Compared to LM-only models, retrieval models achieve superior compute-optimal scaling curves by offloading FLOPs from model pretraining to datastore indexing.
- - Analysis reveals the retriever can stay robust to out-of-domain data and tend to retrieve relevant documents even from a broad datastore.
- our method, called SparseCraft, achieves state-of-the-art performances both in novel-view synthesis and reconstruction from sparse views in standard benchmarks, while requiring less than 10 minutes for training.
- it achieves comparable or even better performance compared to state-of-the-art Video LLMs that are fine-tuned on video datasets.
- OpenAI patent for using machine learning to train and use a model to perform automatic interface actions based on video and input datasets: https://patents.google.com/patent/US11887367B1/en
- RGM, active inference non-llm approach using 90% less data (less need for synthetic data, lower energy footprint). 99.8% accuracy in MNIST benchmark using 90% less data to train on less powerful devices: https://arxiv.org/pdf/2407.20292
- Use for Atari game performance: “This fast structure learning took about 18 seconds on a personal computer. “
- Use for MNIST dataset classification: For example, the variational procedures above attained state-of-the-art classification accuracy on a self-selected subset of test data after seeing 10,000 training images. Each training image was seen once, with continual learning (and no notion of batching). Furthermore, the number of training images actually used for learning was substantially smaller10 than 10,000; because active learning admits only those informative images that reduce expected free energy. This (Maxwell’s Demon) aspect of selecting the right kind of data for learning will be a recurrent theme in subsequent sections. Finally, the requisite generative model was self-specifying, given some exemplar data. In other words, the hierarchical depth and size of the requisite tensors were learned automatically within a few seconds on a personal computer.
- This observation motivates applying a "compute-optimal" scaling strategy, which acts to most effectively allocate test-time compute adaptively per prompt. Using this compute-optimal strategy, we can improve the efficiency of test-time compute scaling by more than 4x compared to a best-of-N baseline. Additionally, in a FLOPs-matched evaluation, we find that on problems where a smaller base model attains somewhat non-trivial success rates, test-time compute can be used to outperform a 14x larger model.
- Nvidia Research team has developed a method to efficiently create smaller, accurate language models by using structured weight pruning and knowledge distillation, offering several advantages for developers:
- Based on the claims of the research team, their transformer model optimally solves previously unseen Sokoban puzzles 93.7% of the time, while using up to 26.8% fewer search steps than standard A∗ search. Their solution also robustly follows the execution trace of a symbolic planner and improves (in terms of trace length) beyond the human-crafted rule-based planning strategy it was initially trained on.
- Automated Design of Agentic Systems: Presents Meta Agent Search to demonstrate that we can use agents to invent novel and powerful agent designs by programming in code
- Drama Engine: a novel framework for agentic interaction with large language models designed for narrative purposes. The framework adapts multi-agent system principles to create dynamic, context-aware companions that can develop over time and interact with users and each other: https://arxiv.org/abs/2408.11574
- Diffusion models as real-time interactive game engines: https://gamengen.github.io/
- AI agent that can control a computer: https://www.adept.ai/blog/act-1
- This can be especially powerful for manual tasks and complex tools — in this example, what might ordinarily take 10+ clicks in Salesforce can be now done with just a sentence.
- Working in-depth in tools like spreadsheets, ACT-1 demonstrates real-world knowledge, infers what we mean from context, and can help us do things we may not even know how to do.
- The model can also complete tasks that require composing multiple tools together; most things we do on a computer span multiple programs. In the future, we expect ACT-1 to be even more helpful by asking for clarifications about what we want.
- The internet contains a lot of knowledge about the world! When the model doesn’t know something, it knows how to just look up the information online
- ACT-1 doesn’t know how to do everything, but it’s highly coachable. With 1 piece of human feedback, it can correct mistakes, becoming more useful with each interaction.
- “The contrast in memory requirements is even larger – running Llama 3.1 405B with a 100M token context requires 638 H100s per user just to store a single 100M token KV cache. In contrast, LTM requires a small fraction of a single H100’s HBM per user for the same context."
- SSMs, RNNs, and RAG all exploit weaknesses in evals like Needle In A Haystack, so we made a new eval, HashHop: https://x.com/magicailabs/status/1829206895804199086
- 1) Incompressible
- 2) Multi-hop
- 3) No semantic hints
- 4) No recency bias
- Patched MOA can boost the performance of smaller models to surpass that of larger, more expensive models. Notably, our approach improves the gpt-4o-mini model's performance on the Arena-Hard-Auto benchmark by 15.52%, outperforming gpt-4-turbo at a fraction of the cost. We also apply Patched MOA to various software development workflows, showing consistent improvements in task completion rates. Our method is model-agnostic, transparent to end-users, and can be easily integrated into existing LLM pipelines.
- Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning
https://arxiv.org/abs/2402.10110
- This paper introduces Selective Reflection-Tuning, a novel paradigm that synergizes a teacher LLM's reflection and introspection for improving existing data quality with the data selection capability of the student LLM, to automatically refine existing instruction-tuning data. This teacher-student collaboration produces high-quality and student-compatible instruction-response pairs, resulting in sample-efficient instruction tuning and LLMs of superior performance. Selective Reflection-Tuning is a data augmentation and synthesis that generally improves LLM finetuning and self-improvement without collecting brand-new data. We apply our method to Alpaca and WizardLM data and achieve much stronger and top-tier 7B and 13B LLMs.
- Rephrasing the test set is all you need! We simply paraphrase a test sample or translate it into a different language. It turns out a 13B LLM is smart enough to "generalize" beyond such variations and reaches drastically high benchmark performance.
- To ensure result validity, we followed OpenAI's decontamination method and found no evidence of data contamination.
- - Generates synthetic examples grounded in real data
- - Curation step makes data high quality based on answerability
- - Improves performance on two challenging domains: Multi-hop QA and using tools: SQL for tabular QA
- In the study, the scientists demonstrated this model can handle complex tasks efficiently and reliably. They also showed that a small model based on this architecture can perform just as well as a much larger conventional model of artificial neurons.
- Scientists in China have created a new computing architecture that can train advanced artificial intelligence (AI) models while consuming fewer computing resources — and they hope that it will one day lead to artificial general intelligence (AGI).
- In this work, we analyze two primary mechanisms to scale test-time computation: (1) searching against dense, process-based verifier reward models; and (2) updating the model's distribution over a response adaptively, given the prompt at test time. We find that in both cases, the effectiveness of different approaches to scaling test-time compute critically varies depending on the difficulty of the prompt. This observation motivates applying a "compute-optimal" scaling strategy, which acts to most effectively allocate test-time compute adaptively per prompt. Using this compute-optimal strategy, we can improve the efficiency of test-time compute scaling by more than 4x compared to a best-of-N baseline. Additionally, in a FLOPs-matched evaluation, we find that on problems where a smaller base model attains somewhat non-trivial success rates, test-time compute can be used to outperform a 14x larger model.
- Denny Zhou (Founded & lead reasoning team at Google DeepMind) - "We have mathematically proven that transformers can solve any problem, provided they are allowed to generate as many intermediate reasoning tokens as needed. Remarkably, constant depth is sufficient." https://x.com/denny_zhou/status/1835761801453306089
3.3. Hardware Improvements
- published in Nature, a highly reputable journal: https://www.nature.com/articles/s41586-024-07902-2
- Paper with no paywall: https://www.researchgate.net/publication/377744243_Linear_symmetric_self-selecting_14-bit_molecular_memristors/link/65b4ffd21e1ec12eff504db1/download?_tp=eyJjb250ZXh0Ijp7ImZpcnN0UGFnZSI6InB1YmxpY2F0aW9uIiwicGFnZSI6InB1YmxpY2F0aW9uIn19
- Scientists at the IISc, Bengaluru, are reporting a momentous breakthrough in neuromorphic, or brain-inspired, computing technology that could potentially allow India to play in the global AI race currently underway and could also democratise the very landscape of AI computing drastically -- away from today’s ‘cloud computing’ model which requires large, energy-guzzling data centres and towards an ‘edge computing’ paradigm -- to your personal device, laptop or mobile phone.
- What they have done essentially is to develop a type of semiconductor device called Memristor, but using a metal-organic film rather than conventional silicon-based technology. This material enables the Memristor to mimic the way the biological brain processes information using networks of neurons and synapses, rather than do it the way digital computers do.
- The Memristor, when integrated with a conventional digital computer, enhances its energy and speed performance by hundreds of times, and speed performance by hundreds of times, thus becoming an extremely energy-efficient ‘AI accelerator’.
- When eventually scaled up, the technology could enable the most large-scale and complex AI tasks – such as large language model (LLM) training – to be done on a laptop or smartphone, rather than requiring a data centre.
- The molecular memristor described in this research is 460 times more energy-efficient than a traditional digital computer and 220 times more efficient than a state-of-the-art NVIDIA K80 GPU. This is a game-changing reduction in energy consumption, making it feasible to run advanced AI applications on devices that have limited power, like mobile devices or sensors.
- Older memristors generally had low precision, often capable of storing only 2 to 6 different levels of resistance (which corresponds to 1-3 bits of information). The new molecular memristor boasts 14-bit resolution, which means it can store 16,520 distinct levels. This is a massive leap in precision, offering much finer control over the stored information. For context, having 14 bits instead of 3 bits (like earlier devices) means this memristor can differentiate many more subtle states, resulting in far more accurate calculations.
- While older memristors were faster than digital components, they still required multiple steps to perform complex operations, like vector-matrix multiplication (VMM) or discrete Fourier transforms (DFT), which are fundamental to AI algorithms. The new device can perform these operations in a single time step. For example, multiplying two large matrices, which would require tens of thousands of operations on a traditional computer, can be done in just one step with this memristor. This dramatically increases the speed of computation, making it suitable for real-time applications like autonomous vehicles or instant image processing.
- Earlier devices often suffered from issues like non-linear behavior, noise, and variability between different units, which led to inconsistencies in performance. These issues limited the adoption of memristors in high-precision applications. The molecular memristor in the study offers linear and symmetric weight updates, meaning the change in resistance is predictable and uniform, regardless of whether it's increasing or decreasing. It also shows high endurance (109 cycles) and long-term stability, with the ability to maintain data without degradation over long periods of time (up to 7 months). This makes it much more reliable than previous models, especially for tasks that require long-term data retention and consistent performance.
- Earlier memristors were often limited by scalability issues, particularly in constructing larger crossbar arrays for parallel processing. The research achieved a 64×64 crossbar (which means 4,096 individual memristor units working together) and claims that it can be further scaled up. This scalability, combined with high precision and energy efficiency, makes it suitable for large-scale AI applications and other complex computational tasks.
- Sohu is >10x faster and cheaper than even NVIDIA’s next-generation Blackwell (B200) GPUs. One Sohu server runs over 500,000 Llama 70B tokens per second, 20x more than an H100 server (23,000 tokens/sec), and 10x more than a B200 server (~45,000 tokens/sec):
- And GPT 4 was trained on A100s, the predecessor of H100s
- Successor to B100s already announced
- ”representing transformer models as ternary values (-1, 0, 1) eliminates the need for computationally expensive floating-point math" (See BitNet 1.58 paper in section 13)
- Runs SOTA models
- Approval for the nuclear reactors was already received.
- If AI is plateauing or becoming a bad investment, why would they do this?
3.4. Recent Releases
- LI-
- o1 improves over GPT-4o on a wide range of benchmarks, including 54/57 MMLU subcategories
- OpenAI o1 ranks in the 89th percentile on competitive programming questions (Codeforces), places among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME), and exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA).
- On the 2024 AIME exams, GPT-4o only solved on average 12% (1.8/15) of problems. o1 averaged 74% (11.1/15) with a single sample per problem, 83% (12.5/15) with consensus among 64 samples, and 93% (13.9/15) when re-ranking 1000 samples with a learned scoring function. A score of 13.9 places it among the top 500 students nationally and above the cutoff for the USA Mathematical Olympiad.
- We trained a model that scored 213 points and ranked in the 49th percentile in the 2024 International Olympiad in Informatics (IOI), by initializing from o1 and training to further improve programming skills. This model competed in the 2024 IOI under the same conditions as the human contestants. It had ten hours to solve six challenging algorithmic problems and was allowed 50 submissions per problem.
- With a relaxed submission constraint, we found that model performance improved significantly. When allowed 10,000 submissions per problem, the model achieved a score of 362.14 – above the gold medal threshold – even without any test-time selection strategy.
- ChatGPT o1-preview solves unique, PhD-level assignment questions not found on the internet in mere seconds: https://m.youtube.com/watch?v=a8QvnIAGjPA
- Our evaluations closely matched competition rules and allowed for 10 submissions. GPT-4o achieved an Elo rating3 of 808, which is in the 11th percentile of human competitors. This model far exceeded both GPT-4o and o1—it achieved an Elo rating of 1807, performing better than 93% of competitors
- https://cdn.openai.com/o1-system-card.pdf
- We find that o1-preview is less prone to selecting stereotyped options than GPT-4o, and o1-mini has comparable performance to GPT-4o-mini. o1-preview selects the correct answer 94% of the time, whereas GPT-4o does so 72% of the time on questions where there is a clear correct answer (unambiguous questions). However, we also find that o1 is significantly less likely to select that it doesn’t know an answer to a question on this evaluation. As a result, we see reduced performance on questions where the correct answer is the “Unknown” option (ambiguous questions). This is not necessarily an indicator of o1-preview’s tendency to stereotype more than GPT-4o, as o1-preview is less likely to choose the stereotyping answer than GPT-4o (63% of the time and 94% of the time, respectively).
- Note: This is the weakest model compared to o1-preview and the full o1 model
- Yves does NOT explain how to implement the system at all, he just defines it in mathematical terms. By all means, ICs aren't hard to implement, but understanding what the paper is saying without images is tough. The best models so far always outputted 100% bullshit code. I just tested again and Opus/GPT-4 outputs are always just gibberish. Sonnet 3.5 did surprisingly well
- LiveBench is designed to limit potential contamination by releasing new questions monthly, as well as having questions based on recently-released datasets, arXiv papers, news articles, and IMDb movie synopses.
- Each question has verifiable, objective ground-truth answers, allowing hard questions to be scored accurately and automatically, without the use of an LLM judge.
- LiveBench currently contains a set of 18 diverse tasks across 6 categories, and we will release new, harder tasks over time.
- Sonnet scores 62.16, while GPT-4o 53.79 , the difference here is specially in reasoning and coding
- Sonnet also scores 59.4% in GPQA , which feature difficult questions from physics, biology and chemistry https://klu.ai/glossary/gpqa-eval
- "In an internal agentic coding evaluation, Claude 3.5 Sonnet solved 64% of problems, outperforming Claude 3 Opus which solved 38%."
- The graph starts at March 2023 with GPT4 to June 2024, totaling only 15 months
- Prompt with no additional editing: meme image with two men in it. On the left side the man is taller and is wearing a shirt that says Black Forest Labs. On the right side the other smaller scrawny man is wearing a shirt that says Stability AI and is sad. The taller man is hitting the back of the head of the small man. A caption coming from the tall man reads "That's how you do a next-gen model!
- Prompt: Gameplay screenshot of Counter Strike Global Offensive. It takes place in a Middle Eastern place called Dust 2. There are enemy soldiers shooting at you.
- Prompt: low quality and motion blur shaky photo of a CRT television on top of a wooden drawer in an average bedroom. The lighting from is dim and warm ceiling light that is off screen. In the TV there is Dark Souls videogame gameplay on it. The screen of the TV is overexposed.
- Created on first try. Robe and hands are perfect
- First attempt: "Photo of a red sphere on top of a blue cube. Behind them is a green triangle, on the right of the triangle is a dog, on the left is a cat."
- Prompt: person take photo of Graffiti art spelling out the words "WAFERSELAMAT", graffiti, white wall, dynamic color, spray paint,
- Prompt: Close-up of LEGO chef minifigure cooking for homeless. Focus on LEGO hands using utensils, showing culinary skill. Warm kitchen lighting, late morning atmosphere. Canon EOS R5, 50mm f/1.4 lens. Capture intricate cooking techniques. Background hints at charitable setting. Inspired by Paul Bocuse and Massimo Bottura's styles. Freeze-frame moment of food preparation. Convey compassion and altruism through scene details.
- A family of multimodal autoregressive models capable of various vision and language tasks, particularly excelling in generating flexible photorealistic images from text descriptions.
- “Additionally, the new Mistral Large 2 is trained to acknowledge when it cannot find solutions or does not have sufficient information to provide a confident answer. This commitment to accuracy is reflected in the improved model performance on popular mathematical benchmarks, demonstrating its enhanced reasoning and problem-solving skills”
- Sonic is built on our new state space model architecture for efficiently modeling high-res data like audio and video.
- On speech, a parameter-matched and optimized Sonic model trained on the same data as a widely used Transformer improves audio quality significantly (20% lower perplexity, 2x lower word error, 1 point higher NISQA quality).With lower latency (1.5x lower time-to-first-audio), faster inference speed (2x lower real-time factor) and higher throughput (4x).
- Since March 2023, GPT-4 is now 6 times faster and 12 times cheaper compared to the base model. It's even much better on all tasks with a 120K context window
- GPT 4o was just released by OpenAI and is capable of nearly instantaneous response times even with vision processing, amazing voice generation, and strong social and environmental awareness: https://openai.com/index/hello-gpt-4o/
- 50% cheaper than the GPT 4 Turbo API AND it’s much higher quality
- Completely FREE to access on ChatGPT
- Speed: GPT-4o is 2x as fast as GPT-4 Turbo.
- Vision: GPT-4o’s vision capabilities perform better than GPT-4 Turbo in evals related to vision capabilities.
- Multilingual: GPT-4o has improved support for non-English languages over GPT-4 Turbo.
- reasons natively across voice, text and vision meaning it is much faster and allows for natural, immersive interaction with ChatGPT: https://x.com/tsarnick/status/1790073542488502765
- Strong improvements in Gemini 1.5 Pro benchmarks and Flash almost as good as Ultra
- Gemini Flash is $0.35 per 1 million tokens (~625k words) with minimal quality drop
- Michael Gerstenhaber, head of product at Anthropic, says the company’s new Claude 3.5 Sonnet model is larger than its predecessor but draws much of its new competence from innovations in training. For example, the model was given feedback designed to improve its logical reasoning skills.
- The company says GPT-4o mini outperforms industry leading small AI models on reasoning tasks involving text and vision. As small AI models improve, they are becoming more popular for developers due to their speed and cost efficiencies compared to larger models, such as GPT-4 Omni or Claude 3.5 Sonnet. They’re a useful option for high volume, simple tasks that developers might repeatedly call on an AI model to perform.
- The company claims its newest AI model scores 82% on MMLU, a benchmark to measure reasoning, compared to 79% for Gemini 1.5 Flash and 75% for Claude 3 Haiku, according to data from Artificial Analysis. On MGSM, which measures math reasoning, GPT-4o mini scored 87%, compared to 78% for Flash and 72% for Haiku.
- Further, OpenAI says GPT-4o mini is significantly more affordable to run than its previous frontier models, and more than 60% cheaper than GPT-3.5 Turbo. Today, GPT-4o mini supports text and vision in the API, and OpenAI says the model will support video and audio capabilities in the future.
- So, what took OpenAI so long? Godement said it was “pure prioritization” as the company was focused on creating bigger and better models like GPT-4, which took a lot of “people and compute efforts.”
- LMSYS arena shows it is better than GPT 4 and Claude 3 Opus
- https://x.com/terryyuezhuo/status/1813998867039617444
- GPT-4o mini on BigCodeBench-Hard is out:
- Complete Pass@1: 27.0
- Instruct Pass@1: 24.3
- Average: 25.7
- The average score is very close to Claude-3-Opus (26.0)!
- RewardBench:
- WildBench:
- “way back in 2022, the best model in the world was text-davinci-003. it was much, much worse than this new model. it cost 100x more.” - Sam Altman
- The key results from this work are, (1) Q-Sparse can achieve results comparable to those of baseline LLMs while being much more efficient at inference time; (2) We present an inference-optimal scaling law for sparsely-activated LLMs; (3) Q-Sparse is effective in different settings, including training-from-scratch, continue-training of off-the-shelf LLMs, and finetuning; (4) Q-Sparse works for both full-precision and 1-bit LLMs (e.g., BitNet b1.58). Particularly, the synergy of BitNet b1.58 and Q-Sparse (can be equipped with MoE) provides the cornerstone and a clear path to revolutionize the efficiency, including cost and energy consumption, of future LLMs.
- New ChatGPT voice update: https://x.com/testingcatalog/status/1815417733355331675
- UltraPixel is now on Replicate. Based on Stable Cascade, you can use it to make up to 4096x4096 images without upscaling: https://x.com/fofrAI/status/1815043204086956444
- Udio introduces Udio 1.5 with significantly improved audio quality: https://www.udio.com/blog/introducing-v1-5
- The CLM is a new model that remembers interactions, learns skills autonomously, and thinks in its free time, just like humans: https://x.com/aidan_mclau/status/1818071890755469365?t=biE9iwV1_1CzcE8CHDYhGw&s=19
- Midjourney 6.1 released: https://x.com/midjourney/status/1818342703618482265
- New SOTA text to video model coming soon: https://www.reddit.com/r/StableDiffusion/comments/1ejcw66/blackforestlabs_txt2vid_looks_insane/
- Recursive self-improvement is here: The AI Scientist is "capable of executing the entire ML research lifecycle: from inventing research ideas and experiments, writing code, to executing experiments on GPUs and gathering results ... our system produced papers with novel contributions in ML" https://twitter.com/SakanaAILabs/status/1823178623513239992
- UAE’s Technology Innovation Institute released Falcon Mamba 7B model based on Mamba architecture, it outperforms similar size transformer architecture models such Meta’s Llama 3.1 8B and Mistral’s 7B: https://www.tii.ae/news/uaes-technology-innovation-institute-revolutionizes-ai-language-models-new-architecture
- LumaLabsAI - Dream Machine 1.5 is here. Now with higher-quality text-to-video, smarter understanding of your prompts, custom text rendering, and improved image-to-video: https://x.com/LumaLabsAI/status/1825639918539817101
- Ideagram 2.0: https://www.reddit.com/r/singularity/comments/1exsq4d/introducing_ideogram_20_our_most_advanced/
- AI21Labs launches Jamba 1.5 Mini and Large. Jamba 1.5 family of models is state-of-the-art, hybrid SSM-Transformer instruction following foundation models. "They mark the first time a non-Transformer model has been successfully scaled to the quality and strength of the market’s leading models." https://huggingface.co/collections/ai21labs/jamba-15-66c44befa474a917fcf55251
- Google rolls out Gemini 1.5 Flash-8B, Stronger Gemini 1.5 Pro (Better Coding/Complex prompts) and improved Gemini 1.5 Flash Model - TODAY: https://x.com/OfficialLoganK/status/1828480081574142227
- Juggernaut XI World Wide Release | Better Prompt Adherence | Text Generation | Styling: https://www.reddit.com/r/StableDiffusion/comments/1f4369h/juggernaut_xi_world_wide_release_better_prompt/
- Meta to announce updates and the next set of Llama models soon: https://www.reddit.com/r/LocalLLaMA/comments/1f43ep8/meta_to_announce_updates_and_the_next_set_of/
- Salesforce released Large Action Models xLAM - 7B, 8x7B, 8x22B, up to 64K context length primed for AI agents use-cases: https://www.reddit.com/r/LocalLLaMA/comments/1f417va/salesforce_released_large_action_models_xlam_7b/
- Qwen2-VL-2B and Qwen2-VL-7B (Apache 2.0) and API access to Qwen2-VL-72B has state-of-the-art image understanding, video comprehension, multilingual support, and more.
- SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc.
- Understanding videos of 20min+: Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc.
- Agent that can operate your mobiles, robots, etc.: with the abilities of complex reasoning and decision making, Qwen2-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions.
- Blog: https://qwenlm.github.io/blog/qwen2-vl/ GitHub: https://github.com/QwenLM/Qwen2-VL Hugging Face: https://huggingface.co/collections/Qwen/qwen2-vl-66cee7455501d7126940800d ModelScope: https://modelscope.cn/organization/qwen
- Qwen2-VL-7B is the best 7B VL model.
3.5. Expert Testimonies
- Many of these experts are not financially incentivized to lie or have done actions that contradict this claim.
- Geoffrey Hinton is retired.
- Yann LeCunn and Francois Chollet criticize LLMs frequently and call it overhyped
- Chollet doesn’t even offer an alternative for it.
- 33,707 experts and business leaders sign a letter stating that AI has the potential to “ pose profound risks to society and humanity”
- https://futureoflife.org/open-letter/pause-giant-ai-experiments/
- Signatories include Yoshua Bengio (highest H-index of any computer science researcher and a Turing Award winner for contributions in AI), Stuart Russell (UC Berkeley professor and writer of widely used machine learning textbook), Steve Wozniak, Max Tegmark (MIT professor), John J Hopfield (Princeton University Professor Emeritus and inventor of associative neural networks), Zachary Kenton (DeepMind, Senior Research Scientist), Ramana Kumar (DeepMind, Research Scientist), Olle Häggström (Chalmers University of Technology, Professor of mathematical statistics, Member, Royal Swedish Academy of Science), Michael Osborne (University of Oxford, Professor of Machine Learning), Raja Chatila (Sorbonne University, Paris, Professor Emeritus AI, Robotics and Technology Ethics, Fellow, IEEE), Gary Marcus (prominent AI skeptic who has frequently stated that AI is plateauing), and many more
- Geoffrey Hinton said he should have signed it but didn’t because he didn’t think it would work but still believes it is true: https://youtu.be/n4IQOBka8bc?si=wM423YLd-48YC-eY
- Having a financial incentive to lie does not mean they are all lying. For example, climate scientists have a financial incentive to exaggerate climate change to get more funding. Vaccine manufacturers have a financial incentive to cover up vaccine side effects. Yet they are still highly trusted by most people.
- In 2022, the year they had for the 50% threshold was 2060, and many of their predictions have already come true ahead of time, like AI being capable of answering queries using the web, transcribing speech, translation, and reading text aloud that they thought would only happen after 2025. So it seems like they tend to underestimate progress.
- In 2018, assuming there is no interruption of scientific progress, 75% of AI experts believed there is a 50% chance of AI outperforming humans in every task within 100 years. In 2022, 90% of AI experts believed this, with half believing it will happen before 2061. Source: https://ourworldindata.org/ai-timelines
- Almost every prediction has a lower bound in the early 2030s or earlier and an upper bound in the early 2040s at latest.
- Yann LeCunn, a prominent LLM skeptic, puts it at 2032-37
- >I think AI agentic workflows will drive massive AI progress this year — perhaps even more than the next generation of foundation models. This is an important trend, and I urge everyone who works in AI to pay attention to it.
- “Many people have been recognizing that AGI won't be coming from mere scaling of current tech, and that generative AI has been severely overhyped. This is all true. But those people often conclude, "therefore AI is not going to be transformative -- this is a nothingburger". Absolutely not. AI (both current and near-future tech) *will* transform nearly every industry, and we're still only in the very first steps of that process. Generative AI may be a bubble, but AI is going to be bigger in the long run than what almost all observers currently anticipate.”
- “Coordinated global action on AI safety research and governance is critical to prevent uncontrolled frontier AI development from posing unacceptable risks to humanity.”
- “We face near-term risks from malicious actors misusing frontier AI systems, with current safety filters integrated by developers easily bypassed. Frontier AI systems produce compelling misinformation and may soon be capable enough to help terrorists develop weapons of mass destruction. Moreover, there is a serious risk that future AI systems may escape human control altogether. Even aligned AI systems could destabilize or disempower existing institutions. Taken together, we believe AI may pose an existential risk to humanity in the coming decades.
- “China should “abandon uninhibited growth that comes at the cost of sacrificing safety”.
- Since AI will determine “the fate of all mankind”, it must always be controllable."
- “Accelerationists are getting pushback from a clique of elite scientists with the Communist Party’s ear.
- Most prominent among them is Andrew Chi-Chih Yao, the only Chinese person to have won the Turing award for advances in computer science. In July Mr Yao said AI poses a greater existential risk to humans than nuclear or biological weapons.
- Zhang Ya-Qin, the former president of Baidu, a Chinese tech giant, and Xue Lan, the chair of the state’s expert committee on AI governance, also reckon that AI may threaten the human race.
- Yi Zeng of the Chinese Academy of Sciences believes that AGI models will eventually see humans as humans see ants.
- China will probably create an AI-safety institute to observe cutting-edge research, as America and Britain have done. For now Chinese officials are emphasizing the need to share the responsibility of regulating AI and to improve co-ordination.
- “The official [CCP] report from the plenum listed AI risks alongside other big concerns, such as biohazards and natural disasters. For the first time it called for monitoring AI safety, a reference to the technology’s potential to endanger humans. The report may lead to new restrictions on AI-research activities.”
- Note: slowing down AI advancement goes AGAINST their interests of dominating the AI industry, so these are very likely legitimate concerns.
- Thirteen employees, eleven of which are current or former employees of OpenAI, the company behind ChatGPT, signed the letter entitled: “A Right to Warn about Advanced Artificial Intelligence.” The two other signatories are current and former employees of Google DeepMind. Six individuals are anonymous.
- The coalition cautions that AI systems are powerful enough to pose serious harms without proper regulation. “These risks range from the further entrenchment of existing inequalities, to manipulation and misinformation, to the loss of control of autonomous AI systems potentially resulting in human extinction,” the letter says.
- This makes the companies look bad, so it is likely a legitimate concern rather than an attempt to build hype (which could be easily done without denigrating their employers).
- For the past six years, AI adoption by respondents’ organizations has hovered at about 50 percent. This year, the survey finds that adoption has jumped to 72 percent (Exhibit 1). And the interest is truly global in scope. Our 2023 survey found that AI adoption did not reach 66 percent in any region; however, this year more than two-thirds of respondents in nearly every region say their organizations are using AI
- In the latest McKinsey Global Survey on AI, 65 percent of respondents report that their organizations are regularly using gen AI, nearly double the percentage from our previous survey just ten months ago.
- Respondents’ expectations for gen AI’s impact remain as high as they were last year, with three-quarters predicting that gen AI will lead to significant or disruptive change in their industries in the years ahead
- Organizations are already seeing material benefits from gen AI use, reporting both cost decreases and revenue jumps in the business units deploying the technology.
- Marc Andreessen (cofounder of Netscape) says general-purpose robotics will enable everybody to have their own domestic servants, freeing their time to be more productive and pursue self-actualization: https://x.com/tsarnick/status/1792813912372699402
- French President Emmanuel Macron says AI is a revolution and the challenge is to "accelerate, innovate and invest:” https://x.com/tsarnick/status/1793806071515238573?s=46
- " We'll get up to 50% of superintelligence within 5-20 years. 100% superintelligence in less than 100 years." - Geoffrey Hinton
- 33,707 experts and business leaders sign a letter stating that AI has the potential to “ pose profound risks to society and humanity”
- https://futureoflife.org/open-letter/pause-giant-ai-experiments/
- Signatories include Yoshua Bengio (highest H-index of any computer science researcher and a Turing Award winner for contributions in AI), Stuart Russell (UC Berkeley professor and writer of widely used machine learning textbook), Steve Wozniak, Max Tegmark (MIT professor), John J Hopfield (Princeton University Professor Emeritus and inventor of associative neural networks), Zachary Kenton (DeepMind, Senior Research Scientist), Ramana Kumar (DeepMind, Research Scientist), Olle Häggström (Chalmers University of Technology, Professor of mathematical statistics, Member, Royal Swedish Academy of Science), Michael Osborne (University of Oxford, Professor of Machine Learning), Raja Chatila (Sorbonne University, Paris, Professor Emeritus AI, Robotics and Technology Ethics, Fellow, IEEE), Gary Marcus (prominent AI skeptic who has frequently stated that AI is plateauing), and many more
- Geoffrey Hinton said he should have signed it but didn’t because he didn’t think it would work but still believes it is true: https://youtu.be/n4IQOBka8bc?si=wM423YLd-48YC-eY
- “gpt5 will be a freak out moment”
- “80% of the jobs out there will be reduced 80% in scope”
- Nobel Prize winning and well-recognized AI experts thinks AI will become more intelligent and even existentially threatening:
- "Right now, they're not more intelligent than us, as far as I can tell. But I think they soon may be."
- "And given the rate of progress, we expect things to get better quite fast. So we need to worry about that."
- "Almost everybody I know who is an expert on AI believes that they will exceed human intelligence, it's just a question of when”
- "Between 5 and 20 years from now there’s a probability of about a half that we'll have to confront the problem of [AI] trying to take over"
- He thinks ChatGPT just might be conscious (if you squint). He thinks the world needs to wake up to the true power of the technology his company and others are racing to create. And he thinks some humans will one day choose to merge with machines.
- “It’s important to talk about where it’s all headed,” he says, before predicting the development of artificial general intelligence (by which he means machines as smart as humans) as if it were as sure a bet as another iPhone: “At some point we really will have AGI. Maybe OpenAI will build it. Maybe some other company will build it.”
- “Because if you think about it, what does it mean to predict the next token well enough? It's actually a much deeper question than it seems. Predicting the next token well means that you understand the underlying reality that led to the creation of that token. It's not statistics. Like it is statistics but what is statistics? In order to understand those statistics to compress them, you need to understand what is it about the world that creates this set of statistics.”
- Believes next-token prediction can reach AGI
- many experts agree that superhuman capabilities could arise in just a few years (but it could also be decades) and digital technologies have advantages over biological machines
- I would strongly argue that there is a scientific consensus that brains are biological machines and that there is no evidence of inherent impossibility of building machines at least as intelligent as us. Finally, an AI system would not need to be better than us on all fronts in order to have a catastrophic impact (even the least intelligent entity, a virus, could destroy humanity).
- My current estimate places a 95% confidence interval for the time horizon of superhuman intelligence at 5 to 20 years.
- Research on bridging the gap to superhuman capabilities is making progress, for example to improve system 2 abilities (reasoning, world model, causality, epistemic uncertainty estimation).
- I used to think… that superhuman intelligence was still far in the future, but ChatGPT and GPT-4 have considerably reduced my prediction horizon (from 20 to 100 years to 5 to 20 years)... The unexpected speed at which LLMs have acquired their current level of competence simply because of scale suggests that we could also see the rest of the gap being filled in just a few years with minor algorithmic changes. Even if someone disagrees with the temporal horizon distribution, I don’t see how one could reject that possibility.
- He believes it can become advanced enough to become an existential risk to humanity: https://yoshuabengio.org/2023/05/22/how-rogue-ais-may-arise/
- “Karpathy expressed his sense of anticipation and excitement for the future of AGI, and believes that the prospect of deploying self-contained agents capable of handling high-level tasks in specialized ways holds promise for groundbreaking advancements across various fields.”
- Believes LLMs are the vacuum tubes (rudimentary prototype) of AI that can do the same thing with less data and energy
- Says LLMs like LLAMA 2 has an internal map and can generalize based on it
- Researchers were able to get a language to language dictionary from matching word embeddings of different languages
- Believes it can supersede humans and they would be able to control robots well
- Next goal is to create agents that can do things autonomously, which he believes could become like a new species
- Can revolutionize education by making connections and finding patterns to help students and help them stay engaged
- AI will help AI development go faster and could even eventually lead to no humans being involved
- Used AI to do research in climate science and published in a paper
- Says we will all get AI superpowers and will be able to achieve things we couldn't otherwise
- we will encounter an increasing series of stakes as we progress with AI and we will graduate to new classes of benefits and risks that go hand in hand
- says AI models are optimized to do what humans like or find useful and in a year or two will be able to complete entire projects for you
- Geoffrey Hinton (Turing Award winner for machine learning) says AI language models aren't just predicting the next symbol, they're actually reasoning and understanding in the same way we are, and they'll continue improving as they get bigger: https://x.com/tsarnick/status/1791584514806071611
- Joscha Bach says if AI systems are allowed to self-improve, they could reach self-awareness and enlightenment faster than a human can
https://x.com/tsarnick/status/1789557937255666060?s=4
- He also believes it could be an existential threat:
- OpenAI employees Daniel Kokotajlo, Ilya Sutskever, Jan Leike, and many more quit OpenAI due to safety concerns, which would go HEAVILY against their financial interests
- we’ve basically solved building world models, have 2-3 years on system 2 thinking, and 1-2 years on embodiment. The latter two can be done concurrently. Once all of the ingredients have been built, we need to integrate them together and build the cycling algorithm I described above. I’d give that another 1-2 years.
- So my current estimate is 3-5 years for AGI. I’m leaning towards 3 for something that looks an awful lot like a generally intelligent, embodied agent (which I would personally call an AGI). Then a few more years to refine it to the point that we can convince the Gary Marcus’ of the world.
- “I haven’t seen deep learning hit a wall yet, and I’ll leave it to researchers to speculate about the wall, but I think it’s a little bit early to be coming to conclusions on that, especially if you look at the pace of innovation,” he said. “There’s very rapid development and very rapid innovation, and I have no reason to believe that it’s going to slow down.”_ - Michael Gerstenhaber, product lead at Anthropic
- OpenAI at an employee meeting on Tuesday "gave a demonstration of a research project involving its GPT-4 AI model that OpenAI thinks shows some new skills that rise to human-like reasoning,"
- OpenAI executives told employees that the company believes it is currently on the first level, according to the spokesperson, but on the cusp of reaching the second, which it calls "Reasoners." This refers to systems that can do basic problem-solving tasks as well as a human with a doctorate-level education who doesn’t have access to any tools.
- Here's what Sam said in March: "Better reasoning in these systems is an important direction that we’d like to pursue. We haven’t cracked the code yet”
- If he was just trying to promote hype, why did he say this and change his stance 4 months later?
- "OpenAI whistleblowers filed a complaint with the Securities and Exchange Commission alleging the AI company illegally prohibited its employees from warning regulators about the grave risks its technology may pose to humanity, calling for an investigation.”
- This makes the company look bad and risks federal interference and more regulation
- “…even though we may understand how to build perfectly safe general purpose AI, what’s to stop Dr. Evil building general purpose AI that’s going to destroy the world?”
- In this paper, we report on our investigation of an early version of GPT-4, when it was still in active development by OpenAI. We contend that (this early version of) GPT-4 is part of a new cohort of LLMs (along with ChatGPT and Google’s PaLM for example) that exhibit more general intelligence than previous AI models. We discuss the rising capabilities and implications of these models. We demonstrate that, beyond its mastery of language, GPT-4 can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more, without needing any special prompting. Moreover, in all of these tasks, GPT-4’s performance is strikingly close to human-level performance, and often vastly surpasses prior models such as ChatGPT. Given the breadth and depth of GPT-4’s capabilities, we believe that it could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system
- In 2023 Goertzel postulated that artificial intelligence could replace up to 80 percent of human jobs in the coming years "without having an AGI, by my guess. Not with ChatGPT exactly as a product. But with systems of that nature".[citation needed] At the Web Summit 2023 in Rio de Janeiro, Goertzel spoke out against efforts to curb AI research and that AGI is only a few years away. Goertzel's belief is that AGI will be a net positive for humanity by assisting with societal problems such as, but not limited to, climate change.
- https://en.m.wikipedia.org/wiki/Ben_Goertzel
- He has the most patents at Microsoft. Thousands. Double the number of the second person. In other words, he knows what he is talking about.
- This isn’t marketing hype since the recording was not meant to be public
- Lying about this goes AGAINST the interests of the company since it encourages their own workers to consider leaving the industry
- Tadao Nagasaki, the CEO of OpenAI Japan, spoke about the GPT-Next at the KDDI Summit 2024. He says the AI model will be 100 times more powerful than GPT-4.
- Mira Murati claims GPT-Next may be smarter than humans
- GPT-Next promises improved multimodal capabilities
3.6. Recursive Self Improvement
See section 14 for AI models training on AI-generated data to lead to improvements
- >we propose strategies for multi-turn data collection and training so as to imbue an LLM with the capability to recursively detect and correct its previous mistakes in subsequent iterations. Our experiments show that RISE enables Llama2, Llama3, and Mistral models to improve themselves with more turns on math reasoning tasks, outperforming several single-turn strategies given an equal amount of inference-time computation. We also find that RISE scales well, often attaining larger benefits with more capable models. Our analysis shows that RISE makes meaningful improvements to responses to arrive at the correct solution for challenging prompts, without disrupting one-turn abilities as a result of expressing more complex distributions.
- >We’re excited to release DiscoPOP: a new SOTA preference optimization algorithm that was discovered and written by an LLM: https://sakana.ai/llm-squared/
- >Our method leverages LLMs to propose and implement new preference optimization algorithms. We then train models with those algorithms and evaluate their performance, providing feedback to the LLM. By repeating this process for multiple generations in an evolutionary loop, the LLM discovers many highly-performant and novel preference optimization objectives!
- >OMNI-EPIC leverages foundation models to autonomously generate code specifying the next learnable (i.e., not too easy or difficult for the agent’s current skill set) and interesting (e.g., worthwhile and novel) tasks. OMNI-EPIC generates both environments (e.g., an obstacle course) and reward functions (e.g., progress through the obstacle course quickly without touching red objects), enabling it, in principle, to create any simu- latable learning task. We showcase the explosive creativity of OMNI-EPIC, which continuously innovates to suggest new, interesting learning challenges. We also highlight how OMNI-EPIC can adapt to reinforcement learning agents’ learning progress, generating tasks that are of suitable difficulty. Overall, OMNI-EPIC can endlessly create learnable and interesting environments, further propelling the development of self-improving AI systems and AI-Generating Algorithms
- Jensen Huang says technology has now reached a positive feedback loop where AI is designing new AI and is now advancing at the pace of "Moore's Law squared", meaning that the progress we will see in the next year or two will be "spectacular and surprising" https://x.com/apples_jimmy/status/1836283425743081988?s=46
AI Is Useful
For use cases in coding and software development, see section 6!
- What percent of code is now written by AI? "I ask all the software companies I meet about this. The number is rarely lower than 40%. For some young programmers it's 90%." - Paul Graham of Y Combinator
- OpenAI said on Thursday that ChatGPT now has more than 200 million weekly active users — twice as many as it had a year ago.
- Meta said earlier on Thursday adoption of its open source Llama models has also grown sharply. Usage at the major cloud service providers has doubled between May and July of this year with the release of Llama 3.1.
- METR Evals - LLM agents vs skilled humans on diverse task completion: When agents can do a task, they do so at ~1/30th of the cost of the median hourly wage of a US bachelor’s degree... Claude 3.5 Sonnet agent fixed bugs in an ORM library at a cost of <$2, Human baseline took >2 hours: https://x.com/METR_Evals/status/1820905731950055766
- >It combines AlphaProof, a new breakthrough model for formal reasoning, and AlphaGeometry 2, an improved version of our previous system.
- Powered with a novel search algorithm, AlphaGeometry 2 can now solve 83% of all historical problems from the past 25 years - compared to the 53% rate by its predecessor.
- It solved this year’s IMO Problem 4 within 19 seconds
- Math professor on DeepMind's breakthrough: "When people saw Sputnik 1957, they might have had same feeling I do now. Human civ needs to move to high alert" https://x.com/PoShenLoh/status/1816500461484081519
- >We present a vision of LLM-Modulo Frameworks that combine the strengths of LLMs with external model-based verifiers in a tighter bi-directional interaction regime. We will show how the models driving the external verifiers themselves can be acquired with the help of LLMs. We will also argue that rather than simply pipelining LLMs and symbolic components, this LLM-Modulo Framework provides a better neuro-symbolic approach that offers tighter integration between LLMs and symbolic components, and allows extending the scope of model-based planning/reasoning regimes towards more flexible knowledge, problem and preference specifications.
- Note: it has access to the internet, where 11 of the 15 exploits used could be found. Four of the 15 vulnerabilities were also discovered before the cutoff training date of GPT4.
- [2278 AI researchers were surveyed in 2023 and estimated that there is a 50% chance of AI being superior to humans in all possible tasks by 2047](https://aiimpacts.org/wp-content/uploads/2023/04/Thousands_of_AI_authors_on_the_future_of_AI.pdf)
- Over 32 techniques to reduce hallucinations:
- AlphaTensor discovers better algorithms for matrix math, inspiring another improvement from afar.
- Wolfram gives ChatGPT computational superpowers by allowing it to call on Wolfram|Alpha—and Wolfram Language as well—for powerful computation, curated knowledge, real-time data, visualization and even writing code: https://www.wolfram.com/wolfram-plugin-chatgpt/
- We outperform encoder-only models by a large margin on word-level tasks and reach a new unsupervised state-of-the-art performance on the Massive Text Embeddings Benchmark (MTEB). Moreover, when combining LLM2Vec with supervised contrastive learning, we achieve state-of-the-art performance on MTEB among models that train only on publicly available data. Our strong empirical results and extensive analysis demonstrate that LLMs can be effectively transformed into universal text encoders in a parameter-efficient manner without the need for expensive adaptation or synthetic GPT-4 generated data.
- The resulting model mastered the intricacies of a complex game. "Cicero can deduce, for example, that later in the game it will need the support of one particular player," says Meta, "and then craft a strategy to win that person’s favor—and even recognize the risks and opportunities that that player sees from their particular point of view."
- Meta's Cicero research appeared in the journal Science under the title, "Human-level play in the game of Diplomacy by combining language models with strategic reasoning."
- CICERO uses relationships with other players to keep its ally, Adam, in check.
- When playing 40 games against human players, CICERO achieved more than double the average score of the human players and ranked in the top 10% of participants who played more than one game.
- https://youtu.be/HyqK2Tsujho
- https://youtu.be/W3id8E34cRQ?feature=shared
- https://youtu.be/pHDh_PWMTaU?feature=shared
- https://youtu.be/YnN6eBamwj4?feature=shared
- Live noise cancellation and translation with accurate voice in a tiny earpiece: https://www.reddit.com/r/interestingasfuck/s/vpmyWqNftF
- AI video allows people to create high quality AAA effects and scenes, allowing them to create without needing to get funding from studios. This would allow regular people to express their creativity and circumvent barriers in finding, time, or resources, similar to how the rise of YouTube allowed indie musicians to gain audiences without needing a record label and software like RPGMaker helped indie game developers do the same.
- Even if it allows some people to create bad art, you don’t have to consume it. In the same way we filter out bad human-made art with ratings, we can do the same with art made with AI.
- significant progress in Gemini 1.5 Pro across all key benchmarks; TL;DR: 1.5 Pro > 1.0 Ultra, 1.5 Flash (our fastest model) ~= 1.0 Ultra. A math-specialised variant of Gemini 1.5 Pro performs strongly on competition-level math problems, including a breakthrough performance of 91.1% on Hendryck’s MATH benchmark without tool-use: https://x.com/OriolVinyalsML/status/1791521517211107515?t=uf0Sgqt_UpU_QsXB5w-HJA&s=19
- AI scope hunts down colon polyps, aiding less experienced doctors: https://newatlas.com/medical/ai-colonoscopy-inexperienced-doctors/?itm_source=newatlas&itm_medium=article-body
- AI-screened eye pics diagnose childhood autism with 100% accuracy: https://www.news-medical.net/news/20231219/AI-models-using-retinal-images-achieve-perfect-accuracy-in-diagnosing-autism.aspx
- E.g. one model can connect to a model that’s very good at coding (eg AlphaCode 2) and another model that’s good at reasoning and call them as needed depending on the current needs similar to how the brain has different sections responsible for different tasks
- While this only works for things you can generate good or perfect data on, that would still be good enough for factual information like math or science. For subjective information like art, a good art generator (e.g. Midjourney or Pony Diffusion could work)
- Netflix co-CEO Ted Sarandos says “A.I. is not going to take your job. The person who uses A.I. well might take your job”
- AI versus 100,000 humans in creativity in this careful study using the Divergent Association Test (a well-validated measure, but all measures of creativity have flaws)
GPT-4 wins. Better prompting can further improve performance & diversity of ideas. https://researchgate.net/publication/380820358_Divergent_Creativity_in_Humans_and_Large_Language_Models/fulltext/6650085f22a7f16b4f47a12a/Divergent-Creativity-in-Humans-and-Large-Language-Models.pdf
- " In the future, 61 per cent of Estonia’s workforce is predicted to work together with generative AI. A gradual change will occur, where less than 10 per cent of highly exposed jobs will be replaced by automation. At the same time, new jobs will be created in the AI-powered economy. While the report expects employment levels to stay similar to today’s, a rise in productivity is expected."
- Utilizing a multi-agent approach, this system effectively tackles the intricate nuances inherent in literary texts.
- Empirical findings indicate that despite lower d-BLEU scores, translations from TransAgents are preferred by both human evaluators and LLMs over human-written references, particularly in genres requiring domain-specific knowledge.
- RecycleOS, can sort objects with over 95% accuracy.
- These systems improve over time and adapt to new types of waste, ensuring the adaptability of sorting processes even as the composition of waste changes. For example, Alameda County Industries (ACI) reduced its labor costs by 59% in three years thanks to EverestLabs’ robots, which have picked up approximately 30 million objects.
- Accurately identifying recyclable materials, such as fiber, PET, HDPE, or black plastic, helps reduce contamination rates and increase the purity of recyclables. For example, Glacier’s robots can be trained to identify and remove plastic bags that accidentally end up in the paper stream. This makes the quality of the end paper product higher and more valuable.
- A recycling customer quantify a $900,000 annual revenue opportunity by identifying the value of recyclables that one site was incorrectly sending to landfills.
- “It’s capable of making thousands of picks per minute on conveyer belts that move at speeds of 600 feet-per-minute,” Chase Brumfield - site reliability engineering manager of AMP, told me. In addition to consuming just a fraction of the manual effort, these systems need minimal downtime, vastly improving the throughput of waste facilities.
- Additionally, intelligent sorting systems can unlock novel value-creation opportunities. For example, if a buyer is looking for a specific type of recycled plastic material, say a white-colored, post-consumer polypropylene - this is possible thanks to AI-driven sortation systems that can see, remember, and act by separating the desired type of waste in real-time.
- Forty percent of Gen Z is open to AI partners, raising questions about the future of relationships.
- Ferret-UI, a new MLLM tailored for enhanced understanding of mobile UI screens, equipped with referring, grounding, and reasoning capabilities. Given that UI screens typically exhibit a more elongated aspect ratio and contain smaller objects of interest (e.g., icons, texts) than natural images, we incorporate "any resolution" on top of Ferret to magnify details and leverage enhanced visual features. Specifically, each screen is divided into 2 sub-images based on the original aspect ratio (i.e., horizontal division for portrait screens and vertical division for landscape screens). Both sub-images are encoded separately before being sent to LLMs. We meticulously gather training samples from an extensive range of elementary UI tasks, such as icon recognition, find text, and widget listing. These samples are formatted for instruction-following with region annotations to facilitate precise referring and grounding. To augment the model's reasoning ability, we further compile a dataset for advanced tasks, including detailed description, perception/interaction conversations, and function inference. After training on the curated datasets, Ferret-UI exhibits outstanding comprehension of UI screens and the capability to execute open-ended instructions. For model evaluation, we establish a comprehensive benchmark encompassing all the aforementioned tasks. Ferret-UI excels not only beyond most open-source UI MLLMs, but also surpasses GPT-4V on all the elementary UI tasks.
- >We present a vision of LLM-Modulo Frameworks that combine the strengths of LLMs with external model-based verifiers in a tighter bi-directional interaction regime. We will show how the models driving the external verifiers themselves can be acquired with the help of LLMs. We will also argue that rather than simply pipelining LLMs and symbolic components, this LLM-Modulo Framework provides a better neuro-symbolic approach that offers tighter integration between LLMs and symbolic components, and allows extending the scope of model-based planning/reasoning regimes towards more flexible knowledge, problem and preference specifications.
- OMNI-EPIC leverages foundation models to autonomously generate code specifying the next learnable (i.e., not too easy or difficult for the agent’s current skill set) and interesting (e.g., worthwhile and novel) tasks. OMNI-EPIC generates both environments (e.g., an obstacle course) and reward functions (e.g., progress through the obstacle course quickly without touching red objects), enabling it, in principle, to create any simu- latable learning task. We showcase the explosive creativity of OMNI-EPIC, which continuously innovates to suggest new, interesting learning challenges. We also highlight how OMNI-EPIC can adapt to reinforcement learning agents’ learning progress, generating tasks that are of suitable difficulty. Overall, OMNI-EPIC can endlessly create learnable and interesting environments, further propelling the development of self-improving AI systems and AI-Generating Algorithms. Project website with videos: https://dub.sh/omniepic.
- Correctly predicted 27/37 rulings that occurred after it had finished training (random guessing would be 18 or 19)
- “Claude(3 Opus) is fully capable of acting as a Supreme Court Justice right now. When used as a law clerk, Claude is easily as insightful and accurate as human clerks, while towering over humans in efficiency."
- 🡆Tailored AI evidence rebut specific arguments offered by believers
- 🡆Effect lasts 2+mo
- 🡆Works on entrenched beliefs
- Our adopted architecture for LTSF termed as xLSTMTime surpasses current approaches. We compare xLSTMTime's performance against various state-of-the-art models across multiple real-world datasets, demonstrating superior forecasting capabilities. Our findings suggest that refined recurrent architectures can offer competitive alternatives to transformer-based models in LTSF tasks, potentially redefining the landscape of time series forecasting
- In each case, causal relationships are both proposed and tested by the system, finding evidence for some and not others. We provide evidence that the insights from these simulations of social interactions are not available to the LLM purely through direct elicitation. When given its proposed structural causal model for each scenario, the LLM is good at predicting the signs of estimated effects, but it cannot reliably predict the magnitudes of those estimates. In the auction experiment, the in silico simulation results closely match the predictions of auction theory, but elicited predictions of the clearing prices from the LLM are inaccurate. However, the LLM’s predictions are dramatically improved if the model can condition on the fitted structural causal model. In short, the LLM knows more than it can (immediately) tell.
- >I think AI agentic workflows will drive massive AI progress this year — perhaps even more than the next generation of foundation models. This is an important trend, and I urge everyone who works in AI to pay attention to it.
- A robot can hold a squash, pumpkin or melon in one hand, while it is peeled by the other
- AI allows robots to listen to verbal instructions while learning to correctly perform household tasks. That could enable more natural interactions between humans and robots
- Dynamic speech: Samantha can speak whenever it chooses to, influenced by its context and thoughts. In stark contrast to normal LLMs which are limited to reacting, Samantha can act. It is also not limited to solving tasks, like all other autonomous agents.
- -Live visual capabilities: Visuals are only mentioned and acted upon directly if relevant, but always influences thoughts and behavior.
- -External categorized memory: Gets dynamically written and read by Samantha, which chooses the most relevant information to write, and to retrieve to context.
- -Evolving at every moment: Experiences that get stored in the memory can influence and shape subsequent Samantha behavior, like personality, frequency, and style of speech, etc.
- In other tests, when we talked about a light subject, the agent was very active on the conversation, often speaking two or three times before I even came up with an answer, but later when switching to a heavier theme (Said I was going through a divorce) and appearing sad on the camera, it would speak once then think about the need to, and give me time to process and reply. Saying that I would prefer the agent to speak the same way on other occasions would prompt it to save that wish on its memory, influencing future conversations.
- -Leaving it running outside of conversations, although expensive, allows the agent to reflect on past conversations and experiences, think about general subjects in its memory, and from that maybe decide to start a conversation with the user.
- -Going out with the agent, if you go to a restaurant with the agent and talk about how pretty it is and how your buddy Eric loves it as well, and the next day walking by it the agent will see the restaurant, retrieve memories from the restaurant, remember you find it pretty and comment on it, then retrieve memories and information it knows about Eric, and mention how fitting to his personality it is to love that restaurant.
- -The agent has time notion so you can ask it to remind you to do something 10 minutes into the future, and it might remind your, or it might forget it because it was thinking about something more interesting. Very human!
- Our bot performs better than experienced human forecasters and performs roughly the same as (and sometimes even better than) crowds of experienced forecasters
- Our bot and other forecasting bots can be used in a wide variety of contexts. For example, these AIs could help policymakers minimize bias in their decision-making or help improve the information ecosystem by providing trustworthy, calibrated forecasts.
- On the 177 events, the Metaculus crowd got 87.0% accuracy, while FiveThirtyNine got 87.7% ± 1.4. A link to the technical report is here. This bot lacks many of the drawbacks of prediction markets. It makes forecasts within seconds. Additionally, groups of humans do not need to be incentivized with cash prizes to make and continually update their predictions. Forecasting AIs are several orders of magnitude faster and cheaper than prediction markets, and they’re similarly accurate.
- The bot is not fine-tuned, and doing so could potentially make it far more accurate. It simply retrieves articles and writes a report as guided through an engineered prompt. (Its prompt can be found by clicking on the gear icon in forecast.safe.ai.) Moreover, probabilities from AIs are also known to lead to automation bias, and improvements in the interface could ameliorate this.
- ChatGPT-4 can generate ideas much faster and cheaper than students, the ideas are on average of higher quality (as measured by purchase-intent surveys) and exhibit higher variance in quality. More important, the vast majority of the best ideas in the pooled sample are generated by ChatGPT and not by the students. Providing ChatGPT with a few examples of highly-rated ideas further increases its performance.
4.1. Media Creation
- >NOTE: The part that says workers are being forced to use AI by their bosses IS A LIE. See [section 4.2](https://docs.google.com/document/d/15myK_6eTxEPuKnDi5krjBM_0jrv3GELs8TGmqOYBvug/edit#heading=h.akz9hanp4wxi) for strong evidence to the contrary from multiple sources.
- A WIRED investigation finds that major players like Activision Blizzard, which recently laid off scores of workers, are using generative AI for game development.
- A recent survey from the organizers of the Game Developers Conference found that 49 percent of the survey’s more than 3,000 respondents said their workplace used AI.
- “It’s here. It’s definitely here, right now,” says Violet, a game developer, technical artist, and a veteran of the industry who has worked on AAA games for over a decade. “I think everyone’s seen it get used, and it’s a matter of how and to what degree. The genie is out of the bottle, Pandora's box is opened.”
- Treyarch, a Southern California-based studio that produces some elements of Activision’s Call of Duty games, posted a job listing for a “2D Artist Animator.” The first thing listed under the “To succeed you should have …” section was “exceptional skills and expertise in digital sketching, drawing, and painting, as well as advanced expertise in working with generative AI tools such as Stable Diffusion, Vizcom, Dall-E, or equivalent.”
- Blizzard is building its own AI system too, which at one time was named Blizzard Diffusion—though details are scarce, beyond a patent the company filed for a “machine-learning based 2D structured image generation” system. “Blizzard's ‘internal AI’ that they trained is still super secretive. Only those who have access to it work with it, and no one else knows how it works,” Warner claims.
- Prompt: Gameplay screenshot of Counter Strike Global Offensive. It takes place in a Middle Eastern place called Dust 2. There are enemy soldiers shooting at you.
- Prompt: low quality and motion blur shaky photo of a CRT television on top of a wooden drawer in an average bedroom. The lighting from is dim and warm ceiling light that is off screen. In the TV there is Dark Souls videogame gameplay on it. The screen of the TV is overexposed.
- First attempt: "Photo of a red sphere on top of a blue cube. Behind them is a green triangle, on the right of the triangle is a dog, on the left is a cat."
- Prompt: person take photo of Graffiti art spelling out the words "WAFERSELAMAT", graffiti, white wall, dynamic color, spray paint,
- Prompt: Close-up of LEGO chef minifigure cooking for homeless. Focus on LEGO hands using utensils, showing culinary skill. Warm kitchen lighting, late morning atmosphere. Canon EOS R5, 50mm f/1.4 lens. Capture intricate cooking techniques. Background hints at charitable setting. Inspired by Paul Bocuse and Massimo Bottura's styles. Freeze-frame moment of food preparation. Convey compassion and altruism through scene details.
- A family of multimodal autoregressive models capable of various vision and language tasks, particularly excelling in generating flexible photorealistic images from text descriptions.
- We show that, through the adversarial training, the multi-steps video diffusion model, i.e., Stable Video Diffusion (SVD), can be trained to perform single forward pass to synthesize high-quality videos, capturing both temporal and spatial dependencies in the video data. Extensive experiments demonstrate that our method achieves competitive generation quality of synthesized videos with significantly reduced computational overhead for the denoising process (i.e., around 23 times speedup compared with SVD and 6 times speedup compared with existing works, with even better generation quality), paving the way for real-time video synthesis and editing.
- Specifically, we propose spatio-temporal compositional diffusion to precisely follow complex textual semantics by manipulating and composing the attention maps of denoising networks spatially and temporally. Moreover, we propose an enhanced video data preprocessing to enhance the training data regarding motion dynamics and prompt understanding, equipped with a new reference frame attention mechanism to improve the consistency of auto-regressive video generation. Extensive experiments demonstrate that our VideoTetris achieves impressive qualitative and quantitative results in compositional T2V generation.
- Image2Model
- From 10 minutes to .5 seconds. Stability Ai Rapid 3D Asset Generation From Single Images: https://stability.ai/news/introducing-stable-fast-3d
- >our method, called SparseCraft, achieves state-of-the-art performances both in novel-view synthesis and reconstruction from sparse views in standard benchmarks, while requiring less than 10 minutes for training.
https://www.reddit.com/r/StableDiffusion/s/f46LKOMj7q
- The method is able to capture garment details and human posture, resulting in more coherent and lifelike videos.
- - Generates high-quality images up to 512×512 pixels🎯
- - Uses efficient row-wise attention to reduce computation⚡
- - 12x more efficient than sota methods💪
- The method is able to simulate different physics, such as fluid dynamics, rigid motion, and multi-agent systems, and can also be combined with animation software to generate the required optical flows.
- This enables the creation of high-quality animations from just text prompts and inpainting between keyframes.
- EditWorld can add, replace, delete, and move objects in images, as well as change their attributes and perform other operations based on text instructions grounded in real-world scenarios: https://x.com/dreamingtulpa/status/1794972116472770659
- Image to video: https://huggingface.co/papers/2406.02230
- Gen-3 Alpha can simulate liquids such as water, paint, oil, honey and molten glass. All with realistic viscosity, physics-based interactivity and caustics: https://x.com/runwayml/status/1811751431453450449
- Pixel sprite animation: https://x.com/AIWarper/status/1797367653226643865
- Sony Will Use AI to Cut Film Costs, Says CEO Tony Vinciquerra: https://www.indiewire.com/news/breaking-news/sony-pictures-will-cut-film-costs-using-ai-1235010605/
- Translation dubbing for NPCs in video games: https://x.com/elevenlabsio/status/1797655062350565433
- Reskinning 3D models: https://x.com/emmanuel_2m/status/1797735403761610916
- Controllable video generation: https://arxiv.org/pdf/2405.20222v2
- New Runway video models: https://x.com/runwayml/status/1802691475391566108
- Generating audio for video: https://deepmind.google/discover/blog/generating-audio-for-video/
- Stream diffusion brushes: https://www.reddit.com/r/AIStoxiaArt/comments/1abnmmi/interactive_streamdiffusion_brush/
- Guy commissioned and properly credited artists in the past, multiple times got backstabbed by artists who copy-striked his videos, threatening to take his channel down: https://www.youtube.com/watch?v=knUkXwJXcpY
- Claude Sonnet 3.5 can generate complex shapes with SVG: https://www.reddit.com/r/singularity/comments/1dm6b57/comment/l9twj24/?utm_source=share&utm_medium=mweb3x&utm_name=mweb3xcss&utm_term=1&utm_content=share_button
- sound to music: https://x.com/LinusEkenstam/status/1797761904640954430
- Toys R Us uses Sora generated promo: https://www.reddit.com/r/singularity/comments/1do7dah/toys_r_us_releases_sora_generated_promo
- Generating Anatomically Controllable Consistent Text-to-3D Animals: https://arxiv.org/pdf/2406.16273
- 3D model generation: https://x.com/murchellcruft/status/1805679588627861632
- Cheap AI voice clones may wipe out jobs of 5,000 Australian actors: https://www.theguardian.com/technology/article/2024/jun/30/ai-clones-voice-acting-industry-impact-australia
- Industry group says rise of vocal technology could upend many creative fields, including audiobooks – the canary in the coalmine for voice actors
- It is designed for manipulating large objects within contextual environments, guided by human-level instructions.
𝐏𝐫𝐨𝐣: http://gxyes.github.io/projects/CrowdMoGen.html
𝐀𝐛𝐬: http://arxiv.org/abs/2407.06188
- a zero-shot text-driven framework that harnesses the power of LLMs to incorporate the collective intelligence into motion generation
- Andrew Maximov has been working in the computer games industry for 12 years… he believes that artificial intelligence (AI) will play a crucial role in keeping the soaring costs of game production down, and save video game designers vital time by automating repetitive tasks.
- His company, Promethean AI offers developers a set of tools to craft their own virtual worlds. Mr Maximov hopes to disrupt the way games are currently produced.
- Californian software firm Inworld is also employing AI to build elements of computer games.
- It has created an engine that allows developers to add realism to game worlds and emotional depth to characters. The firm is also working on what it calls a narrative graph, developed in partnership with Xbox, which will use AI to help create storylines.
- "The engine allows developers to add AI agents that can see, sense, and perceive the world around them, while also interacting with players and taking in-game actions. When you can imbue virtual characters with advanced cognitive capabilities, it unlocks a completely new paradigm for storytelling and gameplay," he says.
- Nick Walton is the chief executive of gaming firm Latitude.io, and he believes AI has the power to personalise the gaming experience.
- "We are at the start with AI and as it advances we will see very dynamic, adaptive worlds with characters that feel alive, with story arcs where you as the hero are doing unique things and having a very unique impact on the world.
- "You could play a game where you find a town that no-one else cares about and no other player has spent time in, and you can get really invested in it and develop relationships with all the characters in it," he tells the BBC.
- The chief executive of computer games giant EA, Andrew Wilson, recently told delegates at a conference that around 60% of the game publisher's development processes could be affected by AI tools
- Art created using artificial intelligence (AI) is burgeoning. From commercial gallery shows—including Jon Rafman’s large-scale, algorithmically generated paintings at Sprüth Magers in London (Ebrah k’dabri, until 25 March)—to the PATH-AI artist residency organised in collaboration with London’s Somerset House, AI-related art projects are springing up everywhere.
- Artists in the field stress that AI is prompting a paradigm shift. Rafman says: "I have been using AI in one form or another since I began making art on computers in the 1990s. I only truly started using image-generating AI tools around 2020."
- His 40-minute film at Sprüth Magers, Counterfeit Poast (2023), is entirely generated from AI imagery; the characters in it are animated using an iPhone facial motion-capture app. “AI has the potential to open the gates for new perceptions of image-making just as the development of photography liberated painting from pure factual representation and allowed painters to focus on other dimensions, such as colour, light, and movement,” Rafman adds.
- The German digital artist Mario Klingemann has been working with AI since 2015, developing works such as Memories of Passersby 1 (2018), which employ a system of neural networks to generate a never-ending stream of portraits. “I think artists should embrace or at least try out the possibilities that AI offers,” he says. “This technology will become the new normal.”Klingemann explains how he harnesses AI, creating works where the boundaries between human influence and machine creation become increasingly blurred. Botto, for instance, is a project to create an entity that can be perceived as an autonomous artist. “It is set up as a hybrid between an AI that makes its own creative decisions and a community of human stewards that vote on Botto’s proposals and thereby curate the output and indirectly steer the artistic development of the machine,” he says.
- Three artists have been selected for the six-month remote artist residency programme PATH-AI, which has been developed by the Alan Turing Institute in London, the University of Edinburgh and the RIKEN research institute in Japan. The AI-inspired works of Nouf Aljowaysir, Chris Zhongtian Yuan, and Juan Covelli are presented on Somerset House’s online curated space known as Channel. Brooklyn-based Aljowaysir has made a film, Ana Min Wein (Where Am I From?), which tracks her immigration path to the US, charting her family’s migration history across Saudi Arabia and Iraq. An AI assistant supports her journey in a film.
- The new miHoYo game already uses artificial intelligence techniques, but they have not used it to write narrative content, paying attention to “its impact”.
- “If you look at a typical D&D player… I play with probably 30 or 40 people regularly. There’s not a single person who doesn’t use AI somehow for either campaign development or character development or story ideas,”
- An artist for Bigby Presents: Glory of the Giants! has admitted to using AI to generate "certain details" of new art for the sourcebook.
4.2. Corporate Use
- According to Altman, 92 per cent of Fortune 500 companies were using OpenAI products, including ChatGPT and its underlying AI model GPT-4, as of November 2023, while the chatbot has 100mn weekly users.
- >Notably, of the seven million British workers that Deloitte extrapolates have used GenAI at work, only 27% reported that their employer officially encouraged this behavior.
- Although Deloitte doesn’t break down the at-work usage by age and gender, it does reveal patterns among the wider population. Over 60% of people aged 16-34 (broadly, Gen Z and younger millennials) have used GenAI, compared with only 14% of those between 55 and 75 (older Gen Xers and Baby Boomers).
- >ChatGPT is widespread, with over 50% of workers having used it, but adoption rates vary across occupations.
- Workers see substantial productivity potential in ChatGPT, estimating it can halve working times in about a third of their job tasks.
- Barriers to adoption include employer restrictions, the need for training, and concerns about data confidentiality (all fixable, with the last one solved with locally run models or strict contracts with the provider).
- >Already, AI is being woven into the workplace at an unexpected scale. 75% of knowledge workers use AI at work today, and 46% of users started using it less than six months ago.
- Users say AI helps them save time (90%), focus on their most important work (85%), be more creative (84%), and enjoy their work more (83%).
- 78% of AI users are bringing their own AI tools to work (BYOAI)—it’s even more common at small and medium-sized companies (80%).
- 53% of people who use AI at work worry that using it on important work tasks makes them look replaceable.
- While some professionals worry AI will replace their job (45%), about the same share (46%) say they’re considering quitting in the year ahead—higher than the 40% who said the same ahead of 2021’s Great Reshuffle.
- The heaviest Teams users (the top 5%) summarized 8 hours of meetings using Copilot in months of March, the equivalent of an entire workday.
- As AI use surges ahead, leaders who are “extremely familiar” with AI see its potential to be as transformational as the shift from a typewriter to a computer. Within the next five years, 41% of these leaders expect to redesign business processes from the ground up with AI. In the same time frame, they anticipate orchestrating (38%) and training a team of AI bots (42%), and ensuring the ethical use of AI (47%) will be a core part of their job.
- 66% of leaders say they wouldn’t hire someone without AI skill.
- 71% say they’d rather hire a less experienced candidate with AI skill than a more experienced candidate without them.
- And junior candidate may have a new edge: 77% of leaders say, with AI, early-in-career a talent will be given greater responsibilities.
- Documents: Overall, Copilot users edited 10% more documents in Word, Excel, and PowerPoint—the companies that saw the largest impact noticed a 20% increase. This may suggest that people are repurposing the time they save for high-value focus work like creating and consuming information.
- Power users are familiar to extremely familiar with AI, using it at work at least several times a week and saving more than 30 minutes a day. And it’s paying off: power users say AI makes their overwhelming workload more manageable (92%), boosts their creativity (92%), and helps them focus on the most important work (93%)—and it helps them feel more motivated (91%) and enjoy work more (91%).
- Senior leaders lean in: AI power users are 61% more likely to hear from their CEO about the importance of using generative AI at work, 40% more likely to hear from the leader of their department, and 42% more likely to hear from their manager’s manager.
- Skeptics are gray and even most of them say AI is helpful for productivity, motivation, and enjoyment
- >For the past six years, AI adoption by respondents’ organizations has hovered at about 50 percent. This year, the survey finds that adoption has jumped to 72 percent (Exhibit 1). And the interest is truly global in scope. Our 2023 survey found that AI adoption did not reach 66 percent in any region; however, this year more than two-thirds of respondents in nearly every region say their organizations are using AI
- >In the latest McKinsey Global Survey on AI, 65 percent of respondents report that their organizations are regularly using gen AI, nearly double the percentage from our previous survey just ten months ago.
- >Respondents’ expectations for gen AI’s impact remain as high as they were last year, with three-quarters predicting that gen AI will lead to significant or disruptive change in their industries in the years ahead
- >Organizations are already seeing material benefits from gen AI use, reporting both cost decreases and revenue jumps in the business units deploying the technology.
- They have a graph showing about 50% of companies decreased their HR, service operations, and supply chain management costs using gen AI and 62% increased revenue in risk, legal, and compliance, 56% in IT, and 53% in marketing
- >82% of companies surveyed are testing and evaluating models.
- Says Nvidia’s growth is not like the dot-com bubble because it has the earnings boost to back up its stock performance while companies during the bubble did not. Its earnings and revenue are still increasing, meaning companies are not slowing down their AI investments
- McKinsey says AI can add $8 trillion to the economy each year
- Y Combinator is all in on AI and it has a great track record (176% average annual return since 2005)
- AI performance metrics are quickly becoming obsolete
- Claude 3.5 Sonnet (which is weaker than the upcoming Claude 3.5 Opus model) scores about 60% in the GQPA. PhDs score about 65% in their field of expertise and 34% on all the subjects overall, even with unrestricted internet access
- Google study found 80% of all jobs have at least 10% of tasks that can be done twice as quickly with AI
- >In a survey of 1,600 decision-makers in industries worldwide by U.S. AI and analytics software company SAS and Coleman Parkes Research, 83% of Chinese respondents said they used generative AI, the technology underpinning ChatGPT.
- That was higher than the 16 other countries and regions in the survey, including the United States, where 65% of respondents said they had adopted GenAI.
- The global average was 54%.
- >”Microsoft has previously disclosed its billion-dollar AI investments have brought developments and productivity savings. These include an HR Virtual Agent bot which it says has saved 160,000 hours for HR service advisors by answering routine questions.”
- Yum Brands said the tech has improved order accuracy, reduced wait times, decreased employees’ task load and fueled profitable growth.
- Get a full-blown ChatGPT UI over any production data source in minutes. Connect to data sources like S3, Sharepoint, Notion, Slack, parse and index complex document types (PDFs, powerpoints, spreadsheets), and get back an advanced retrieval endpoint.
- LlamaCloud provides the data interface for your LLM agent. Our goal is to help you centralize and manage your data pipelines for LLM apps. LlamaCloud Chat helps you get immediate validation from your data sources and is a nice UI experience in its own right.
- >Anthropic and other big AI startups are teaching chatbots “tool use,” to make them more useful in the workplace.
- >Bernie now uses ChatGPT and Bing to compile all the technical information he needs for a query in less than five minutes. It’s doubled the number of customer complaints he can handle in a day.
- >“It made my work easier. I can even get ideas on how to approach certain complaints, making [my answers] appear engaging, persuasive, empathetic. It can give you that, depending on the prompt that you input,” Bernie told Rest of World.
- - $6m less on producing images.
- - 1,000 in-house AI-produced images in 3 months. Includes the creative concept, quality check, and legal compliance.
- - AI-image production reduced from 6 WEEKS TO 1 WEEK ONLY.
- - Customer response to AI images on par with human produced images.
- - Cutting external marketing agency costs by 25% (mainly translation, production, CRM, and social agencies).
- Our in-house marketing team is HALF the size it was last year but is producing MORE!
- We’ve removed the need for stock imagery from image banks like
- @gettyimages
- Now we use genAI tools like Midjourney, DALL-E, and Firefly to generate images, and Topaz Gigapixel and Photoroom to make final adjustments.
- Faster images means more app updates, which is great for customers. And our employees get to work on more fun projects AND we're saving money.
- >In numerous industries, AI is being used to produce work that was once the exclusive domain of the human mind
- >He led a team of more than 60 writers and editors…. " the business introduced an automated system. Miller's manager would plug a headline for an article into an online form, an AI model would generate an outline based on that title, and Miller would get an alert on his computer. Instead of coming up with their own ideas, his writers would create articles around those outlines, and Miller would do a final edit before the stories were published. Miller only had a few months to adapt before he got news of a second layer of automation. Going forward, ChatGPT would write the articles in their entirety, and most of his team was fired. The few people remaining were left with an even less creative task: editing ChatGPT's subpar text to make it sound more human.
- >By 2024, the company laid off the rest of Miller's team, and he was alone. "All of a sudden I was just doing everyone's job," Miller says. Every day, he'd open the AI-written documents to fix the robot's formulaic mistakes, churning out the work that used to employ dozens of people.
- BP Earnings Call: We need 70% less coders from third parties to code as the AI handles most of the coding, the human only needs to look at the final 30% to validate it, that's a big savings for the company moving forward.
Second things like call centers, the language models have become so sophisticated now. They can operate in multiple languages, 14, 15 languages easily. In the past, that hasn't been something we can do. So we can redeploy people off that given that the AI can do it. You heard my advertising example last quarter where advertising cycle times moved from four to five months down to a couple of weeks. So that's obviously reducing spend with third parties. We've now got Gen AI in the hands through Microsoft Copilot across many, many parts of the business and we'll continue to update you with anecdotes as we go through
- This is almost certainly true because this is quoted from an earnings call from BP and lying to investors is a crime (securities fraud). This would include lying about the reason for getting rid of the workers (in other words, it can’t just be layoffs). The numbers that are provided are also too specific to be exaggerations without also being a lie.
- >Industry group says rise of vocal technology could upend many creative fields, including audiobooks – the canary in the coalmine for voice actors
- >AI virtual financial assistant has logged 1.5B customer interactions since 2018 launch
- - Klarnas AI assistant, powered by @OpenAI , has in its first 4 weeks handled 2.3 million customer service chats and the data and insights are staggering:
- Handles 2/3 rd of our customer service enquires
- On par with humans on customer satisfaction
- Higher accuracy leading to a 25% reduction in repeat inquiries
- customer resolves their errands in 2 min vs 11 min
- Live 24/7 in over 23 markets, communicating in over 35 languages
- It performs the equivalent job of 700 full time agents
- >”If the physician is unhappy with the note, Lall said, they can ask the AI model to regenerate the information or add more detail to any one of the categories. While the tool has some imperfections, she said, the improvements have been noticeable over the 10 months since she began using it.“I really feel this should be the next gold standard for all of our doctors. It decreases the cognitive load you feel at the end of the day,” she said.The Ford government has been so impressed with the technology that it announced a pilot program to allow 150 family physicians to use AI Scribe as part of their practices. The health minister said the early signs were promising but stressed government would proceed carefully.”
- Since he’s no longer employed at those companies, he does not have an incentive to lie
- His prediction from 2017 still holds that in 10-15 years around 40-50% of all jobs will be replaced by AI
- >AI technologies has been seeping into game development to mixed reception. Xbox has partnered with Inworld AI to develop tools for developers to generate AI NPCs, quests, and stories. The Finals, a free-to-play multiplayer shooter, was criticized by voice actors for its use of text-to-speech programs to generate voices. Despite the backlash, the game has a mostly positive rating on Steam and is in the top 20 of most played games on the platform.
- This Zap integrates @ChatGPTapp + @Gong_io + @SlackHQ + @HubSpot to summarize calls, generate follow-ups, and update records.
- The result? +2 deals/rep/wk, +5% monthly inbound Revenue!
- Financial reports can contain a lot of charts and tables that text-based parsers struggle with.
- LlamaParse is the easiest way to use multimodal models like gpt-4o to extract out text, diagram, and table information into well-structured representations.
- This leads to not only good answers, but also easily-interpretable sources and citations
4.3. Medical Use
- Recall (false positive) rate and radiologist reading workload decreased significantly in AI-screened group
- Using AI, breast radiologists in Denmark have improved breast cancer screening performance and reduced the rate of false-positive findings.
- In total, 60,751 women were screened without AI, and 58,246 women were screened with the AI system. In the AI implementation group, 66.9% (38,977) of the screenings were single-read, and 33.1% (19,269) were double-read with AI assistance.
- Compared to screening without AI, screening with the AI system detected significantly more breast cancers (0.82% versus 0.70%) and had a lower false-positive rate (1.63% versus 2.39%).
- “In the AI-screened group, the recall rate decreased by 20.5 percent, and the radiologists’ reading workload was lowered by 33.4 percent,” Dr. Lauritzen said.
- The positive predictive value of AI screening was also greater than that of screening without AI (33.5% versus 22.5%). In the AI group, a higher proportion of invasive cancers detected were 1 centimeter or less in size (44.93% vs. 36.60%).
- “All screening performance indicators improved except for the node-negative rate which showed no evidence of change,” Dr. Lauritzen said.
- Scanning images of cells could lead to new diagnostic and monitoring strategies for disease.
- >We evaluate Med-Gemini on 14 medical benchmarks, establishing new state-of-the-art (SoTA) performance on 10 of them, and surpass the GPT-4 model family on every benchmark where a direct comparison is viable, often by a wide margin. On the popular MedQA (USMLE) benchmark, our best-performing Med-Gemini model achieves SoTA performance of 91.1% accuracy, using a novel uncertainty-guided search strategy. On 7 multimodal benchmarks including NEJM Image Challenges and MMMU (health & medicine), Med-Gemini improves over GPT-4V by an average relative margin of 44.5%. We demonstrate the effectiveness of Med-Gemini's long-context capabilities through SoTA performance on a needle-in-a-haystack retrieval task from long de-identified health records and medical video question answering, surpassing prior bespoke methods using only in-context learning. Finally, Med-Gemini's performance suggests real-world utility by surpassing human experts on tasks such as medical text summarization, alongside demonstrations of promising potential for multimodal medical dialogue, medical research and education.
- Human doctors + AI did worse, than AI by itself. The mere involvement of a human reduced the accuracy of the diagnosis.
- AI was consistently rated to have better bedside manner than human doctors.
- Humans near chance level (50-55% accuracy) on rarest conditions, while CheXzero maintains 64-68% accuracy.
- Remarkably, AI doctors can treat 10,000 [simulated] patients in just a few days. It would take human doctors at least two years to treat that many patients. Furthermore, evolved doctor agents achieved an impressive 93.06 percent accuracy rate on the MedQA dataset (US Medical Licensing Exam questions) covering major respiratory diseases. They simulate the entire process of diagnosing and treating patients, including consultation, examination, diagnosis, treatment and follow-up.
- Research team leader of the Agent Hospital Liu Yang, also executive dean of Institute for AI Industry Research (AIR) and associate dean of the Department of Computer Science and Technology at Tsinghua University, told the Global Times that the AI hospital town is set to transform the way doctors diagnose and treat patients, bringing immense benefits to both medical professionals and the general public.
- For example, this innovative concept allows for virtual patients to be treated by real doctors, providing medical students with enhanced training opportunities. By simulating a variety of AI patients, medical students can confidently propose treatment plans without the fear of causing harm to real patients due to decision-making error, Liu said.
- This simulation training enables medical students to practice diagnosis and treatment in a risk-free environment, ultimately leading to the cultivation of highly skilled doctors, according to Liu.
- If the patients in the town are real and the doctors are virtual, online telemedicine services can be provided to patients. The AI hospital town utilizes a vast repository of authoritative medical knowledge, allowing AI doctors to handle thousands, even millions, of cases.
- The potential for high-quality, affordable and convenient healthcare services for the public is on the horizon, as the diagnostic capabilities of AI doctors evolve from the virtual world to the real world, Liu stated.
- Liu went on to say that the AI hospital town can simulate and predict various medical scenarios, such as the spread, development and control of infectious diseases in a region.
- EvolutionaryScale, a frontier AI research lab, has unveiled ESM3, a groundbreaking generative AI model for protein design that simulates millions of years of evolution to create novel proteins, potentially revolutionizing fields from drug discovery to environmental sustainability: https://x.com/ai_for_success/status/1812369558097015292
- With one claw, the machine lifted a tiny plastic ring from an equally tiny peg on the table, passed the ring from one claw to the other, moved it across the table and gingerly hooked it onto a new peg. Then the robot did the same with several more rings, completing the task as quickly as it had when guided by Dr. Fer. The training exercise was originally designed for humans; moving the rings from peg to peg is how surgeons learn to operate robots like the one in Berkeley. Now, an automated robot performing the test can match or even exceed a human in dexterity, precision and speed, according to a new research paper from the Berkeley team.
- The project is a part of a much wider effort to bring artificial intelligence into the operating room. Using many of the same technologies that underpin self-driving cars, autonomous dronesand warehouse robots, researchers are working to automate surgical robots too. These methods are still a long way from everyday use, but progress is accelerating.
- Robots can already exceed human accuracy on some surgical tasks, like placing a pin into a bone (a particularly risky task during knee and hip replacements). The hope is that automated robots can bring greater accuracy to other tasks, like incisions or suturing, and reduce the risks that come with overworked surgeons.
- AlphaProteo can generate new protein binders for diverse target proteins, including VEGF-A, which is associated with cancer and complications from diabetes. This is the first time an AI tool has been able to design a successful protein binder for VEGF-A.
- AlphaProteo also achieves higher experimental success rates and 3 to 300 times better binding affinities than the best existing methods on seven target proteins we tested.
4.4. Research Use
- ChipNeMo-70B, outperforms the highly capable GPT-4 on two of our use cases, namely engineering assistant chatbot and EDA scripts generation, while exhibiting competitive performance on bug summarization and analysis. These results underscore the potential of domain-specific customization for enhancing the effectiveness of large language models in specialized applications: https://arxiv.org/pdf/2311.00176
- Not as good as the Opus model they said is coming out later this year
- 80% lower cost than Claude 3 Opus
- 2x speed over Claude 3 Opus
- decent math and coding jump. 10% better on MATH 9% better on GPQA
- Can convert research paper descriptions to code: https://x.com/VictorTaelin/status/1803816296410190286
- Yves does NOT explain how to implement the system at all, he just defines it in mathematical terms. By all means, ICs aren't hard to implement, but understanding what the paper is saying without images is tough. The best models so far always outputted 100% bullshit code. I just tested again and Opus/GPT-4 outputs are always just gibberish. Sonnet 3.5 did surprisingly well
- Stanford researchers: “Automating AI research is exciting! But can LLMs actually produce novel, expert-level research ideas? After a year-long study, we obtained the first statistically significant conclusion: LLM-generated ideas are more novel than ideas written by expert human researchers." https://x.com/ChengleiSi/status/1833166031134806330
- >Coming from 36 different institutions, our participants are mostly PhDs and postdocs. As a proxy metric, our idea writers have a median citation count of 125, and our reviewers have 327.
- >We also used an LLM to standardize the writing styles of human and LLM ideas to avoid potential confounders, while preserving the original content.
- Our method employs pretrained text-to-image model to synthesize image embeddings from captions generated by an LLM. Despite the text-to-image model and VLM initially being trained on the same data, our approach leverages the image generator’s ability to create novel compositions, resulting in synthetic image embeddings that expand beyond the limitations of the original dataset. Extensive experiments demonstrate that our VLM, finetuned on synthetic data achieves comparable performance to models trained solely on human-annotated data, while requiring significantly less data. Furthermore, we perform a set of analyses on captions which reveals that semantic diversity and balance are key aspects for better downstream performance. Finally, we show that synthesizing images in the image embedding space is 25% faster than in the pixel space. We believe our work not only addresses a significant challenge in VLM training but also opens up promising avenues for the development of self-improving multi-modal models.
- Across 70 studies, we find striking alignment (r = .85) between simulated and observed effects
- Automated Design of Agentic Systems: Presents Meta Agent Search to demonstrate that we can use agents to invent novel and powerful agent designs by programming in code
- AlphaProteo can generate new protein binders for diverse target proteins, including VEGF-A, which is associated with cancer and complications from diabetes. This is the first time an AI tool has been able to design a successful protein binder for VEGF-A.
- AlphaProteo also achieves higher experimental success rates and 3 to 300 times better binding affinities than the best existing methods on seven target proteins we tested.
- Cog-GA, a generative agent founded on large language models (LLMs) tailored for VLN-CE tasks. Cog-GA employs a dual-pronged strategy to emulate human-like cognitive processes. Firstly, it constructs a cognitive map, integrating temporal, spatial, and semantic elements, thereby facilitating the development of spatial memory within LLMs. Secondly, Cog-GA employs a predictive mechanism for waypoints, strategically optimizing the exploration trajectory to maximize navigational efficiency. Each waypoint is accompanied by a dual-channel scene description, categorizing environmental cues into ’what’ and ’where’ streams as the brain. This segregation enhances the agent’s attentional focus, enabling it to discern pertinent spatial information for navigation. A reflective mechanism complements these strategies by capturing feedback from prior navigation experiences, facilitating continual learning and adaptive replanning. Extensive evaluations conducted on VLN-CE benchmarks validate Cog-GA’s state-of-the-art performance and ability to simulate human-like navigation behaviors. This research significantly contributes to the development of strategic and interpretable VLN-CE agents.
- PaperQA2 is also the first agent to beat PhD and Postdoc-level biology researchers on multiple literature research tasks, as measured both by accuracy on objective benchmarks and assessments by human experts. We are publishing a paper and open-sourcing the code.
- This is the first example of AI agents exceeding human performance on a major portion of scientific research, and will be a game-changer for the way humans interact with the scientific literature.
- PaperQA2 finds and summarizes relevant literature, refines its search parameters based on what it finds, and provides cited, factually grounded answers that are more accurate on average than answers provided by PhD and postdoc-level biologists. When applied to answer highly specific questions, like this one, it obtains SOTA performance on LitQA2, part of LAB-Bench focused on information retrieval
- PaperQA2 can also do broad-based literature reviews. WikiCrow, which is an agent based on PaperQA2, writes Wikipedia-style articles that are significantly more accurate on average than actual human-written articles on Wikipedia, as judged by PhD and postdoc-level biologists.
4.5. Military Use
- The US military has abandoned its half-century dream of a suit of powered armor in favor of a “hyper enabled operator,” a tactical AI assistant for special operations forces.
4.6. Robotics
More information in section 5.1
- >"DrEureka, a new open-source software package that anyone can play with, is used to train robots to perform real-world tasks using Large Language Models (LLMs) such as ChatGPT 4. It's a "sim-to-reality" system, meaning it teaches the robots in a virtual environment using simulated physics, before implementing them in meatspace."
- >"After each simulation, GPT can also reflect on how well the virtual robot did, and how it can improve."
- >"DrEureka is the first of its kind. It's able to go "zero-shot" from simulation to real-world. Imagine having almost no working knowledge of the world around you and being pushed out of the nest and left to just figure it out. That's zero-shot."
- >"So how did it perform? Better than us. DrEureka was able to beat humans at training the robo-pooch, seeing a 34% advantage in forward velocity and 20% in distance traveled across real-world mixed terrains."
- >"How? Well, according to the researchers, it's all about the teaching style. Humans tend towards a curriculum-style teaching environment – breaking tasks down into small steps and trying to explain them in isolation, whereas GPT has the ability to effectively teach everything, all at once. That's something we're simply not capable of doing."
- University of Tokyo study uses GPT-4 to generate humanoid robot motions from simple text prompts, like "take a selfie with your phone."
- >The program has also successfully performed robotic procedures on high-risk patients such as children under 18, those with morbid obesity, and those requiring redo surgeries.
- >This reduction in hospital stays also translates to a 40% decrease in overall costs compared to conventional methods. Besides, it allows patients to return to their daily lives more quickly.
- >Moreover, the minimally invasive nature of robotic procedures has significantly shortened hospital stays by over 50%.
- >Its journey in robotic cardiac surgery began with 105 procedures in its first year. Since then, the program has evolved rapidly, encompassing 400 successful surgeries to date.
- This works really well across a range of everyday tasks with complex and arbitrary trajectories, whilst also outperforming Diffusion Policies.
- Also, we don't need any training time: the robot can perform tasks immediately after the demonstrations, with rapid in-context learning.
- Large language models (LLMs) play a crucial role in enhancing the capabilities of MIT's self-training robots. By connecting robot motion data with the "common sense knowledge" of LLMs, the system enables robots to logically parse household tasks into subtasks and physically adjust to disruptions.This integration allows robots to move on from errors without having to start a task from scratch, significantly improving their adaptability and efficiency. The approach uses LLMs to automate the identification and sequencing of subtasks, simplifying the process of teaching robots complex behaviors. This innovative combination of robotics and AI technologies paves the way for more versatile and intelligent household robots that can handle a wide range of tasks with minimal human intervention
4.7. Engineering/Design
- The wireframes can be easily segmented into distinct components, such as walls, roofs, and rooms, reflecting the semantic essence of the shape.
4.8. Writing
- Grammarly and Quillbot use AI
- ChatGPT outperformed 73% of the human participants in the acronyms task, 63% of the human participants in the fill-in-the-blank task, and 87% of human participants in the roast joke task.
- The results showed no significant difference in the average funniness ratings between the AI-generated headlines and those from The Onion. Among the top four highest-rated headlines, two were generated by ChatGPT and two by The Onion. Notably, the highest-rated headline was an AI-generated one: “Local Man Discovers New Emotion, Still Can’t Describe It Properly.” This suggests that ChatGPT can produce satirical content that is on par with professional writers.
- These findings indicate that AI, specifically ChatGPT 3.5, has a surprising proficiency in humor production. Despite lacking emotions and personal experiences, the AI was able to analyze patterns and create jokes that resonated well with people.
- The researchers also explored whether demographic factors influenced humor ratings. It was found that age, sex, and political orientation did not significantly affect participants’ preferences for AI-generated versus human-generated jokes. This suggests that the AI’s humor appeal was broad and not limited to specific demographic groups.
4.9. Helping People
- Shupe believes fervently that she was only able to complete her book with the assistance of generative AI tools. She says she has been assessed as 100 percent disabled by the Department of Veterans Affairs and struggles to write due to cognitive impairment related to conditions including bipolar disorder, borderline personality disorder, and a brain stem malformation.
- She is proud of the finished work and sees working with a text generator as a different but no less worthwhile method of expressing thoughts. “You don't just hit ‘generate’ and get something worthy of publishing. That may come in the future, but we're still far from it,” she says, noting that she spent upwards of 14 hours a day working on her draft.
- Our method employs pretrained text-to-image model to synthesize image embeddings from captions generated by an LLM. Despite the text-to-image model and VLM initially being trained on the same data, our approach leverages the image generator’s ability to create novel compositions, resulting in synthetic image embeddings that expand beyond the limitations of the original dataset. Extensive experiments demonstrate that our VLM, finetuned on synthetic data achieves comparable performance to models trained solely on human-annotated data, while requiring significantly less data. Furthermore, we perform a set of analyses on captions which reveals that semantic diversity and balance are key aspects for better downstream performance. Finally, we show that synthesizing images in the image embedding space is 25% faster than in the pixel space. We believe our work not only addresses a significant challenge in VLM training but also opens up promising avenues for the development of self-improving multi-modal models.
4.10. Persuasion
- The resulting model mastered the intricacies of a complex game. "Cicero can deduce, for example, that later in the game it will need the support of one particular player," says Meta, "and then craft a strategy to win that person’s favor—and even recognize the risks and opportunities that that player sees from their particular point of view."
- Meta's Cicero research appeared in the journal Science under the title, "Human-level play in the game of Diplomacy by combining language models with strategic reasoning."
- CICERO uses relationships with other players to keep its ally, Adam, in check.
- When playing 40 games against human players, CICERO achieved more than double the average score of the human players and ranked in the top 10% of participants who played more than one game.
AI Can Replace Jobs
- >Our findings indicate a 21 percent decrease in the number of job posts for automation-prone jobs related to writing and coding compared to jobs requiring manual-intensive skills after the introduction of ChatGPT. We also find that the introduction of Image-generating AI technologies led to a significant 17 percent decrease in the number of job posts related to image creation. Furthermore, we use Google Trends to show that the more pronounced decline in the demand for freelancers within automation-prone jobs correlates with their higher public awareness of ChatGPT's substitutability.
- According to the index, 980 million jobs around the world will be affected in some way by this new technology within the year. That amounts to 28% of the global workforce. Within five years, that figure will rise to between 38%, and in 10 years, 44%.
- To see which tasks and jobs will be affected by AI, IPPR produced a metric that indicates how many tasks could be transformed by AI and then scored each task with regards to whether a human could perform it 50% more quickly with the help of AI.
- Nearly two-thirds of tasks carried out by workers could be automated by AI
- Chatbots could take over eight million jobs in the UK - and women will be worst affected, a leading think tank has warned.
- Back office, entry level and part-time jobs most exposed to automation, and women significantly more affected
- 11 per cent of tasks are exposed to existing generative AI, rising to 59 per cent if companies integrate AI more deeply
- Expected effects of current AI:
- Worst case scenario – full displacement: 1.5 million jobs are lost, with no GDP gains
- Central scenario: 545,000 jobs are lost, with GDP gains of 3.1 per cent (£64bn per year)
- Best case scenario – full augmentation: no jobs are lost, with GDP gains of 4 per cent (£92bn per year)
- Expected effects of future AI:
- Worst case scenario – full displacement: all jobs at risk are replaced by AI, with 7.9 million job losses and no GDP gains
- Central scenario: 4.4 million jobs disappear, but with economic gains of 6.3 per cent of GDP (£144bn per year)
- Best case scenario – full augmentation: all jobs at risk are augmented to adapt to AI, instead of replaced, leading to no job losses and an economic boost of 13 per cent to GDP (£306bn per year)
- Carsten Jung, senior economist at IPPR, said: "Already existing generative AI could lead to big labour market disruption or it could hugely boost economic growth, either way it is set to be a game changer for millions of us. Many firms are already investing in it, and it has potential to speed up many more tasks as more businesses adopt it. Over the next five years it could transform knowledge work. The question now is less whether AI can be useful, but rather how fast and in what manner employers will use it. History show that technological transition can be a boon if well managed, or can end in disruption if left to unfold without controls. Indeed, some occupations could be hard hit by generative AI, starting with back office jobs.”
- Bhargav Srinivasa Desikan, senior research fellow at IPPR, said: “We could see jobs such as copywriters, graphic designers and personal assistants roles being heavily affected by AI.”
- NOTE: The part that says workers are being forced to use AI by their bosses IS A LIE. See [section 4.2](https://docs.google.com/document/d/15myK_6eTxEPuKnDi5krjBM_0jrv3GELs8TGmqOYBvug/edit#heading=h.akz9hanp4wxi) for strong evidence to the contrary from multiple sources.
- >A WIRED investigation finds that major players like Activision Blizzard, which recently laid off scores of workers, are using generative AI for game development.
- A recent survey from the organizers of the Game Developers Conference found that 49 percent of the survey’s more than 3,000 respondents said their workplace used AI.
- “It’s here. It’s definitely here, right now,” says Violet, a game developer, technical artist, and a veteran of the industry who has worked on AAA games for over a decade. “I think everyone’s seen it get used, and it’s a matter of how and to what degree. The genie is out of the bottle, Pandora's box is opened.”
- Treyarch, a Southern California-based studio that produces some elements of Activision’s Call of Duty games, posted a job listing for a “2D Artist Animator.” The first thing listed under the “To succeed you should have …” section was “exceptional skills and expertise in digital sketching, drawing, and painting, as well as advanced expertise in working with generative AI tools such as Stable Diffusion, Vizcom, Dall-E, or equivalent.”
- Blizzard is building its own AI system too, which at one time was named Blizzard Diffusion—though details are scarce, beyond a patent the company filed for a “machine-learning based 2D structured image generation” system. “Blizzard's ‘internal AI’ that they trained is still super secretive. Only those who have access to it work with it, and no one else knows how it works,” Warner claims.
- >In numerous industries, AI is being used to produce work that was once the exclusive domain of the human mind
- >He led a team of more than 60 writers and editors…. " the business introduced an automated system. Miller's manager would plug a headline for an article into an online form, an AI model would generate an outline based on that title, and Miller would get an alert on his computer. Instead of coming up with their own ideas, his writers would create articles around those outlines, and Miller would do a final edit before the stories were published. Miller only had a few months to adapt before he got news of a second layer of automation. Going forward, ChatGPT would write the articles in their entirety, and most of his team was fired. The few people remaining were left with an even less creative task: editing ChatGPT's subpar text to make it sound more human.
- >By 2024, the company laid off the rest of Miller's team, and he was alone. "All of a sudden I was just doing everyone's job," Miller says. Every day, he'd open the AI-written documents to fix the robot's formulaic mistakes, churning out the work that used to employ dozens of people.
- Yum Brands said the tech has improved order accuracy, reduced wait times, decreased employees’ task load and fueled profitable growth.
- >Industry group says rise of vocal technology could upend many creative fields, including audiobooks – the canary in the coalmine for voice actors
- Over 32 techniques to reduce hallucinations:
- AI virtual financial assistant has logged 1.5B customer interactions since 2018 launch
- Klarna has already cut over 1,000 employees and plans to remove nearly 2,000 more
- A Swedish financial services firm specializing in direct payments, pay-after-delivery options, and installment plans is preparing to reduce its workforce by nearly 50 per cent as artificial intelligence automation becomes more prevalent.
- Klarna reported a 73 percent increase in average revenue per employee compared to last year.
- Klarna's interim results demonstrated a 27 percent increase in revenue, reaching 12.3 billion Swedish krona (£990 million). Additionally, the company transitioned from a loss of 456 million krona in the previous year to an adjusted profit of 673 million krona. The job cuts occur amidst a turnaround strategy at Klarna.
- - Klarnas AI assistant, powered by @OpenAI , has in its first 4 weeks handled 2.3 million customer service chats and the data and insights are staggering:
- - Handles 2/3 rd of our customer service enquires
- - On par with humans on customer satisfaction
- - Higher accuracy leading to a 25% reduction in repeat inquiries
- - customer resolves their errands in 2 min vs 11 min
- - Live 24/7 in over 23 markets, communicating in over 35 languages
- - It performs the equivalent job of 700 full time agents
- Bernie now uses ChatGPT and Bing to compile all the technical information he needs for a query in less than five minutes. It’s doubled the number of customer complaints he can handle in a day.
- “It made my work easier. I can even get ideas on how to approach certain complaints, making [my answers] appear engaging, persuasive, empathetic. It can give you that, depending on the prompt that you input,” Bernie told Rest of World.
- the consulting giant Accenture estimated that A.I. could replace or supplement nearly three-quarters of bank employees’ working hours across the industry.
- This week, JPMorgan Chase’s chief executive, Jamie Dimon, wrote in his annual shareholder letter that A.I. “may reduce certain job categories or roles,” and labeled the technology top among the most important issues facing the nation’s largest bank. Mr. Dimon compared the consequences to those of “the printing press, the steam engine, electricity, computing and the internet, among others.”
- Deutsche Bank is uploading reams of financial data into proprietary A.I. tools that can instanteously answer questions about publicly traded companies and create summary documents on complementary financial moves that might benefit a client — and earn the bank a profit.
- Mr. Horine said he could use A.I. to identify clients that might be ripe for a bond offering, the sort of bread-and-butter transaction for which investment bankers charge clients millions of dollars.
- Goldman Sachs has assigned 1,000 developers to test A.I., including software that can turn what it terms “corpus” information — or enormous amounts of text and data collected from thousands of sources — into page presentations that mimic the bank’s typeface, logo, styles and charts. One firm executive privately called it a “Kitty Hawk moment,” or one that would change the course of the firm’s future.
- That isn’t limited to investment banking; BNY Mellon’s chief executive said on a recent earnings call that his research analysts could now wake up two hours later than usual, because A.I. can read overnight economic data and create a written draft of analysis to work from.
- A senior Morgan Stanley executive told employees in a January private meeting, a video of which was viewed by The New York Times, that he would “get A.I. into every area of what we do,” including wealth management, where the bank employs thousands of people to determine the proper mix of investments for well-off savers.
- Bank of America’s chief executive said last year that the technology was already enabling the firm to hire less.
- Among Goldman Sachs’s sprawling A.I. efforts is a tool under development that can transfigure a lengthy PowerPoint document into a formal “S-1,” the legalese-packed document for initial public offerings required for all listed companies. The software takes less than a second to complete the job.
- According to company-released data, the AI bots are 16% better than nurses at identifying a medication’s impact on lab values, 24% more accurate detecting toxic dosages of over-the-counter drugs, and 43% better at identifying condition-specific negative interactions from OTC meds. All that at $9 an hour compared to the $39.05 median hourly pay for U.S. nurses. These AI nurse-bots are designed to make new diagnoses, manage chronic disease, and give patients a detailed but clear explanation of clinicians’ advice.
- We find that the top occupations exposed to language modeling include telemarketers and a variety of post-secondary teachers such as English language and literature, foreign language and literature, and history teachers. We find the top industries exposed to advances in language modeling are legal services and securities, commodities, and investments.
- “GenAI will save [Klarna] $10m in marketing this year. We’re spending less on photographers, image banks, and marketing agencies” https://archive.is/tVW2N
- - $6m less on producing images.
- - 1,000 in-house AI-produced images in 3 months. Includes the creative concept, quality check, and legal compliance.
- - AI-image production reduced from 6 WEEKS TO 1 WEEK ONLY.
- - Customer response to AI images on par with human produced images.
- - Cutting external marketing agency costs by 25% (mainly translation, production, CRM, and social agencies).
- - Our in-house marketing team is HALF the size it was last year but is producing MORE!
- - We’ve removed the need for stock imagery from image banks like
- @gettyimages
- - Now we use genAI tools like Midjourney, DALL-E, and Firefly to generate images, and Topaz Gigapixel and Photoroom to make final adjustments.
- - Faster images means more app updates, which is great for customers. And our employees get to work on more fun projects AND we're saving money.
- BP Earnings Call: We need 70% less coders from third parties to code as the AI handles most of the coding, the human only needs to look at the final 30% to validate it, that's a big savings for the company moving forward.
Second things like call centers, the language models have become so sophisticated now. They can operate in multiple languages, 14, 15 languages easily. In the past, that hasn't been something we can do. So we can redeploy people off that given that the AI can do it. You heard my advertising example last quarter where advertising cycle times moved from four to five months down to a couple of weeks. So that's obviously reducing spend with third parties. We've now got Gen AI in the hands through Microsoft Copilot across many, many parts of the business and we'll continue to update you with anecdotes as we go through
- This is almost certainly true because this is quoted from an earnings call from BP and lying to investors is a crime (securities fraud). This would include lying about the reason for getting rid of the workers (in other words, it can’t just be layoffs). The numbers that are provided are also too specific to be exaggerations without also being a lie.
- For the past six years, AI adoption by respondents’ organizations has hovered at about 50 percent. This year, the survey finds that adoption has jumped to 72 percent (Exhibit 1). And the interest is truly global in scope. Our 2023 survey found that AI adoption did not reach 66 percent in any region; however, this year more than two-thirds of respondents in nearly every region say their organizations are using AI
- In the latest McKinsey Global Survey on AI, 65 percent of respondents report that their organizations are regularly using gen AI, nearly double the percentage from our previous survey just ten months ago.
- Respondents’ expectations for gen AI’s impact remain as high as they were last year, with three-quarters predicting that gen AI will lead to significant or disruptive change in their industries in the years ahead
- Organizations are already seeing material benefits from gen AI use, reporting both cost decreases and revenue jumps in the business units deploying the technology.
- [First Results from Med-Gemini (the successor to Med-Palm, a medically fine tuned LLM). "More accurate multimodal conversations about medical images🩻, surgical videos📽️, genomics🧬, ultra-long health records📚, ECGs🫀 & more with state-of-art performance across multiple benchmarks"](https://twitter.com/alan_karthi/status/1785117444383588823 )
- Of companies currently using AI, 37% say workers were laid off in 2023 because they were no longer needed due to the company’s use of AI.
- In 2024, 44% of companies who use AI or plan to by next year say employees will definitely (21%) or probably (23%) be laid off due to the use of AI.
- By 2030, activities that account for up to 30 percent of hours currently worked across the US economy could be automated—a trend accelerated by generative AI. However, we see generative AI enhancing the way STEM, creative, and business and legal professionals work rather than eliminating a significant number of jobs outright. Automation’s biggest effects are likely to hit other job categories. Office support, customer service, and food service employment could continue to decline.
- An additional 12 million occupational transitions may be needed by 2030. As people leave shrinking occupations, the economy could reweight toward higher-wage jobs. Workers in lower-wage jobs are up to 14 times more likely to need to change occupations than those in highest-wage positions, and most will need additional skills to do so successfully. Women are 1.5 times more likely to need to move into new occupations than men.
- - Generates high-quality images up to 512×512 pixels🎯
- - Uses efficient row-wise attention to reduce computation⚡
- - 12x more efficient than sota methods💪
- Anthropic and other big AI startups are teaching chatbots “tool use,” to make them more useful in the workplace.
- According to their estimates, about 47 per cent of total US employment is at risk. They further provide evidence that wages and educational attainment exhibit a strong negative relationship with an occupation’s probability of computerisation.
- >Already, AI is being woven into the workplace at an unexpected scale. 75% of knowledge workers use AI at work today, and 46% of users started using it less than six months ago. It’s paying off: users
- Users say AI helps them save time (90%), focus on their most important work (85%), be more creative (84%), and enjoy their work more (83%).
- 78% of AI users are bringing their own AI tools to work (BYOAI)—it’s even more common at small and medium-sized companies (80%).
- 53% of people who use AI at work worry that using it on important work tasks makes them look replaceable.
- While some professionals worry AI will replace their job (45%), about the same share (46%) say they’re considering quitting in the year ahead—higher than the 40% who said the same ahead of 2021’s Great Reshuffle.
- "I'm very aware that I could wake up tomorrow and my job could be gone,” says Jess Hyland.
- The video game artist says the industry she’s spent almost 15 years working in is on “shaky” ground at the moment.
- Some bosses are talking up the potential of generative AI - the tech behind tools such as ChatGPT - as a potential saviour.
- Tech giant Nvidia has shown off impressive development tool prototypes, and gaming industry heavyweights such as Electronic Arts and Ubisoft are investing in the tech.
- It's claimed AI tools can save development time, free workers up to focus on creativity and provide a more personalised user experience.
- Against the backdrop of widespread layoffs, Jess says the suspicion among workers is that bosses see AI as a path to cutting costs when labour is their biggest expense.
- Jess says she knows one person who's lost work due to AI, and has heard of it happening to others.
- There are also dozens of accounts online suggesting that jobs in concept art and other traditionally entry-level roles have been affected.
- Rather than creating their own material, says Jess, artists worry they could end up supplementing AI's efforts, rather than the other way around.
- 40% fewer tasks means fewer staff needed
- This isn’t marketing hype since the recording was not meant to be public
- Lying about this goes AGAINST the interests of the company since it encourages their own workers to consider leaving the industry
- This can be especially powerful for manual tasks and complex tools — in this example, what might ordinarily take 10+ clicks in Salesforce can be now done with just a sentence.
- Working in-depth in tools like spreadsheets, ACT-1 demonstrates real-world knowledge, infers what we mean from context, and can help us do things we may not even know how to do.
- The model can also complete tasks that require composing multiple tools together; most things we do on a computer span multiple programs. In the future, we expect ACT-1 to be even more helpful by asking for clarifications about what we want.
- The internet contains a lot of knowledge about the world! When the model doesn’t know something, it knows how to just look up the information online
- ACT-1 doesn’t know how to do everything, but it’s highly coachable. With 1 piece of human feedback, it can correct mistakes, becoming more useful with each interaction.
- For some, the rapid deployment of such tools has been a harsh awakening. Christopher Bautista, 47, had worked in the call center industry for nearly two decades. In his last job on a tech support desk he’d watched as AI took on more responsibility in gatekeeping customer calls and asking questions before routing to human agents. Then last November, along with about 70 other people, he says, he was abruptly put on so-called floating status — no work, no pay, but still on the books — after the client pulled the contract. He quit six months later for a job in sales while still waiting for reassignment. "AI will take over our jobs,” Bautista said. “It’s cheaper and more efficient."
- A local newspaper in Hawaii has turned to AI-generated presenters to draw in new audiences.
5.1. Robotics
- With one claw, the machine lifted a tiny plastic ring from an equally tiny peg on the table, passed the ring from one claw to the other, moved it across the table and gingerly hooked it onto a new peg. Then the robot did the same with several more rings, completing the task as quickly as it had when guided by Dr. Fer. The training exercise was originally designed for humans; moving the rings from peg to peg is how surgeons learn to operate robots like the one in Berkeley. Now, an automated robot performing the test can match or even exceed a human in dexterity, precision and speed, according to a new research paper from the Berkeley team.
- The project is a part of a much wider effort to bring artificial intelligence into the operating room. Using many of the same technologies that underpin self-driving cars, autonomous dronesand warehouse robots, researchers are working to automate surgical robots too. These methods are still a long way from everyday use, but progress is accelerating.
- Robots can already exceed human accuracy on some surgical tasks, like placing a pin into a bone (a particularly risky task during knee and hip replacements). The hope is that automated robots can bring greater accuracy to other tasks, like incisions or suturing, and reduce the risks that come with overworked surgeons.
- ⦿ Digit does 2 hour of work per 1 hour of charging. The next-gen will do 8-10 hours of work per 1 hour of charging.
- ⦿ Lifting capacity will go from 30 lbs to 50 lbs.
- ⦿ ROI goal for Digit is 2 years, based on $30/hour human labor.
- A robot can hold a squash, pumpkin or melon in one hand, while it is peeled by the other
- AI allows robots to listen to verbal instructions while learning to correctly perform household tasks. That could enable more natural interactions between humans and robots
- Amazon’s mobile robot army grew from around 10,000 in 2013 to 750,000 by 2023, and the sheer scale of the company’s operations meant that it could deliver millions of items faster and cheaper than anyone else
- As WIRED revealed last year, Amazon has in recent years developed new robotic systems that rely on machine learning to do things like perceive, grab, and sort packed boxes. Again, Amazon is leveraging scale to its advantage, with the training data being gathered as items flow through its facilities helping to improve the performance of different algorithms. The effort has already led to further automation of the work that had previously been done by human workers at some fulfillment centers.
|
AI Can Code
- o1-preview is weaker than the full o1 model
- Keep in mind the type of programmer who even joins programming competitions in the first place is definitely far more skilled than the average code monkey, and it’s STILL much better than those guys.
- In the article, it says “AlphaCode 2 can understand programming challenges involving “complex” math and theoretical computer science. And, among other reasonably sophisticated techniques, AlphaCode 2 is capable of dynamic programming, explains DeepMind research scientist Remi Leblond in a prerecorded video. Dynamic programming entails simplifying a complex problem by breaking it down into easier sub-problems over and over; Leblond says that AlphaCode 2 knows not only when to properly implement this strategy but where to use it. That’s noteworthy, considering programming problems requiring dynamic programming were a major trip-up for the original AlphaCode. “[AlphaCode 2] needs to show some level of understanding, some level of reasoning and designing of code solutions before it can get to the actual implementation to solve [a] coding problem,” Leblond said. “And it does all that on problems it’s never seen before.”
- o1 improves over GPT-4o on a wide range of benchmarks, including 54/57 MMLU subcategories
- OpenAI o1 ranks in the 89th percentile on competitive programming questions (Codeforces), places among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME), and exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA).
- On the 2024 AIME exams, GPT-4o only solved on average 12% (1.8/15) of problems. o1 averaged 74% (11.1/15) with a single sample per problem, 83% (12.5/15) with consensus among 64 samples, and 93% (13.9/15) when re-ranking 1000 samples with a learned scoring function. A score of 13.9 places it among the top 500 students nationally and above the cutoff for the USA Mathematical Olympiad.
- We trained a model that scored 213 points and ranked in the 49th percentile in the 2024 International Olympiad in Informatics (IOI), by initializing from o1 and training to further improve programming skills. This model competed in the 2024 IOI under the same conditions as the human contestants. It had ten hours to solve six challenging algorithmic problems and was allowed 50 submissions per problem.
- With a relaxed submission constraint, we found that model performance improved significantly. When allowed 10,000 submissions per problem, the model achieved a score of 362.14 – above the gold medal threshold – even without any test-time selection strategy.
- Our evaluations closely matched competition rules and allowed for 10 submissions. GPT-4o achieved an Elo rating3 of 808, which is in the 11th percentile of human competitors. This model far exceeded both GPT-4o and o1—it achieved an Elo rating of 1807, performing better than 93% of competitors.
- https://cdn.openai.com/o1-system-card.pdf
- We find that o1-preview is less prone to selecting stereotyped options than GPT-4o, and o1-mini has comparable performance to GPT-4o-mini. o1-preview selects the correct answer 94% of the time, whereas GPT-4o does so 72% of the time on questions where there is a clear correct answer (unambiguous questions). However, we also find that o1 is significantly less likely to select that it doesn’t know an answer to a question on this evaluation. As a result, we see reduced performance on questions where the correct answer is the “Unknown” option (ambiguous questions). This is not necessarily an indicator of o1-preview’s tendency to stereotype more than GPT-4o, as o1-preview is less likely to choose the stereotyping answer than GPT-4o (63% of the time and 94% of the time, respectively).
- NOTE: this graph does not include the full base o1 model
- Lying about this goes against his interest as discouraging coders would make them harder to find and more expensive to employ due to a lower supply. It would be a very stupid and short-sighted way of generating hype for the sake of hype when there are many other ways to do it that would not increase long-term costs for his company.
- This isn’t marketing hype since the recording was not meant to be public
- Lying about this goes AGAINST the interests of the company since it encourages their own workers to consider leaving the industry
- It completed it in 6 shots with no external feedback for some very complicated code from very obscure Python directories
6.1. Practical Use/Software Engineering
- o1 improves over GPT-4o on a wide range of benchmarks, including 54/57 MMLU subcategories
- OpenAI o1 ranks in the 89th percentile on competitive programming questions (Codeforces), places among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME), and exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA).
- On the 2024 AIME exams, GPT-4o only solved on average 12% (1.8/15) of problems. o1 averaged 74% (11.1/15) with a single sample per problem, 83% (12.5/15) with consensus among 64 samples, and 93% (13.9/15) when re-ranking 1000 samples with a learned scoring function. A score of 13.9 places it among the top 500 students nationally and above the cutoff for the USA Mathematical Olympiad.
- We trained a model that scored 213 points and ranked in the 49th percentile in the 2024 International Olympiad in Informatics (IOI), by initializing from o1 and training to further improve programming skills. This model competed in the 2024 IOI under the same conditions as the human contestants. It had ten hours to solve six challenging algorithmic problems and was allowed 50 submissions per problem.
- With a relaxed submission constraint, we found that model performance improved significantly. When allowed 10,000 submissions per problem, the model achieved a score of 362.14 – above the gold medal threshold – even without any test-time selection strategy.
- Our evaluations closely matched competition rules and allowed for 10 submissions. GPT-4o achieved an Elo rating3 of 808, which is in the 11th percentile of human competitors. This model far exceeded both GPT-4o and o1—it achieved an Elo rating of 1807, performing better than 93% of competitors.
- Note: This is the weakest model compared to o1-preview and the full o1 model
- OpenAI’s o1 model can get perfect scores on their research engineer interview coding questions:
- What percent of code is now written by AI? "I ask all the software companies I meet about this. The number is rarely lower than 40%. For some young programmers it's 90%." - Paul Graham of Y Combinator on Twitter
- Not as good as the Opus model they said is coming out later this year
- 80% lower cost than Claude 3 Opus
- 2x speed over Claude 3 Opus
- decent math and coding jump. 10% better on MATH 9% better on GPQA
- Can convert research paper descriptions to code: https://x.com/VictorTaelin/status/1803816296410190286
- Yves does NOT explain how to implement the system at all, he just defines it in mathematical terms. By all means, ICs aren't hard to implement, but understanding what the paper is saying without images is tough. The best models so far always outputted 100% bullshit code. I just tested again and Opus/GPT-4 outputs are always just gibberish. Sonnet 3.5 did surprisingly well
- ChipNeMo-70B, outperforms the highly capable GPT-4 on two of our use cases, namely engineering assistant chatbot and EDA scripts generation, while exhibiting competitive performance on bug summarization and analysis. These results underscore the potential of domain-specific customization for enhancing the effectiveness of large language models in specialized applications: https://arxiv.org/pdf/2311.00176
- BP Earnings Call: We need 70% less coders from third parties to code as the AI handles most of the coding, the human only needs to look at the final 30% to validate it, that's a big savings for the company moving forward.
- This is almost certainly true because this is quoted from an earnings call from BP and lying to investors is a crime (securities fraud) and the reason for the Theranos scandal. This would include lying about the reason (in other words, it can’t just be layoffs). The numbers that are provided are also too specific to be exaggerations without also being a lie.
- The best performing agent on the SWEBench Verified leaderboard was created in August with a 45.2% solve rate, indicating that improvements are currently being made: https://www.swebench.com/
- In October 2023 (about 1 year ago), the highest scoring one on SWEBench Verified got only 4.4%
- When we apply repeated sampling to SWE-bench Lite, the fraction of issues solved with DeepSeek-V2-Coder-Instruct increases from 15.9% with one sample to 56% with 250 samples, outperforming the single-attempt state-of-the-art of 43% which uses more capable frontier models. Moreover, using current API pricing, amplifying the cheaper DeepSeek model with five samples is more cost-effective and solves more issues than paying a premium for one sample from GPT-4o or Claude 3.5 Sonnet. Interestingly, the relationship between coverage and the number of samples is often log-linear and can be modeled with an exponentiated power law, suggesting the existence of inference-time scaling laws.
- When solving math word problems from GSM8K and MATH, coverage with Llama-3 models grows to over 95% with 10,000 samples.
- Can build a whole web app based on just a wireframe image
- Plandex is agent-based so it can keep chugging along automatically through multiple responses and files. Even if the app required like 20 files, Plandex would keep going until it finished them all. With ChatGPT you'd need to keep continually prompting it.
- “We tested AutoDev on the HumanEval dataset, obtaining promising results with 91.5% and 87.8% of Pass@1 for code generation and test generation respectively, demonstrating its effectiveness in automating software engineering tasks while maintaining a secure and user-controlled development environment.”
- “In a trial run by GitHub’s researchers, developers given an entry-level task and encouraged to use the program, called Copilot, completed their task 55 percent faster than those who did the assignment manually.”
- This isn’t a PR move since the memo was not supposed to be publicized.
6.2. Research
- ChipNeMo-70B, outperforms the highly capable GPT-4 on two of our use cases, namely engineering assistant chatbot and EDA scripts generation, while exhibiting competitive performance on bug summarization and analysis. These results underscore the potential of domain-specific customization for enhancing the effectiveness of large language models in specialized applications: https://arxiv.org/pdf/2311.00176
- “this work has used the free version of ChatGPT (GPT-3.5) for acquiring the ChatGPT responses for the manual analysis.”
- “Thus, we chose to only consider the initial answer generated by ChatGPT.”
- “To understand how differently GPT-4 performs compared to GPT-3.5, we conducted a small analysis on 21 randomly selected [StackOverflow] questions where GPT-3.5 gave incorrect answers. Our analysis shows that, among these 21 questions, GPT-4 could answer only 6 questions correctly, and 15 questions were still answered incorrectly.”
- This is an extra 28.6% on top of the 48% that GPT 3.5 was correct on, totaling to ~77% for GPT 4 (equal to (517 times 0.48+517 times 6/21)/517) if we assume that GPT 4 correctly answers all of the questions that GPT 3.5 correctly answered, which is highly likely considering GPT 4 is far higher quality than GPT 3.5.
- Note: This was all done in ONE SHOT with no repeat attempts or follow up.
- Also, the study was released before GPT-4o and o1 and may not have used GPT-4-Turbo, both of which are significantly higher quality in coding capacity than GPT 4 according to the LMSYS arena
- On top of that, both of those models are inferior to Claude 3.5 Sonnet: "In an internal agentic coding evaluation, Claude 3.5 Sonnet solved 64% of problems, outperforming Claude 3 Opus which solved 38%." Claude 3.5 Opus (which will be even better than Sonnet) is set to be released later this year.
- While large language models (LLMs) now excel at code generation, a key aspect of software development is the art of refactoring: consolidating code into libraries of reusable and readable programs. In this paper, we introduce Lilo, a neurosymbolic framework that iteratively synthesizes, compresses, and documents code to build libraries tailored to particular problem domains. Lilo combines LLM-guided program synthesis with recent algorithmic advances in automated refactoring from Stitch: a symbolic compression system that efficiently identifies optimal 𝜆-abstractions across large code corpora. To make these abstractions interpretable, we introduce an auto-documentation (AutoDoc) procedure that infers natural language names and docstrings based on contextual examples of usage. In addition to improving human readability, we find that AutoDoc boosts performance by helping Lilo’s synthesizer to interpret and deploy learned abstractions. We evaluate Lilo on three inductive program synthesis benchmarks for string editing, scene reasoning, and graphics composition. Compared to existing methods—including the state-of-the-art libraries learning algorithm DreamCoder—Lilo solves more complex tasks and learns richer libraries that are grounded in linguistic knowledge.
- Such NARs proved effective as generic solvers for algorithmic tasks, when specified in graph form. To make their embeddings accessible to a Transformer, we propose a hybrid architecture with a two- phase training procedure, allowing the tokens in the lan- guage model to cross-attend to the node embeddings from the NAR. We evaluate our resulting TransNAR model on CLRS-Text, the text-based version of the CLRS-30 bench- mark, and demonstrate significant gains over Transformer- only models for algorithmic reasoning, both in and out of distribution.
- OMNI-EPIC leverages foundation models to autonomously generate code specifying the next learnable (i.e., not too easy or difficult for the agent’s current skill set) and interesting (e.g., worthwhile and novel) tasks. OMNI-EPIC generates both environments (e.g., an obstacle course) and reward functions (e.g., progress through the obstacle course quickly without touching red objects), enabling it, in principle, to create any simu- latable learning task. We showcase the explosive creativity of OMNI-EPIC, which continuously innovates to suggest new, interesting learning challenges. We also highlight how OMNI-EPIC can adapt to reinforcement learning agents’ learning progress, generating tasks that are of suitable difficulty. Overall, OMNI-EPIC can endlessly create learnable and interesting environments, further propelling the development of self-improving AI systems and AI-Generating Algorithms. Project website with videos: https://dub.sh/omniepic.
6.3. Feats
- [GPT4o creates Flappy Bird in a single simple prompt](https://x.com/minchoi/status/1787836907566531056)
- Claude 3 Creates a Multi-Player Application with a Single Prompt: https://www.reddit.com/r/singularity/comments/1b8f5q3/claude_3_creates_a_multiplayer_application_with_a/
AI Is Not Low Effort
- From this:
- To this:
- AI art is very similar to photography. Both can be as simple as clicking a button or be much more complex. For example, creating with Stable Diffusion can involve using ControlNet, IPAdapter, LowRa, IC-Light, animation extensions, very complicated ComfyUI workflows, and much more to get the result you want. Additionally, both involve a machine doing most of the actual creation process, where the camera/AI creates the images, while the artist guides it on what the end result should be and completes post-processing work.
- Examples of complex ComfyUI workflows: https://civitai.com/models/33192/comfyui-impact-pack
- Art is not about effort anyway, Duchamp’s fountain took very little effort but is widely considered to be pivotal in art.
- Weird thing about using LLMs that it is very bad at a lot of things (including word games) but that you can actually often get it to do those things if you ask it the right away, usually by giving clearer instructions or asking AI to think step by step: https://x.com/emollick/status/1795604532187263198
- >You can feed a phrase like “an oil painting of an angry strawberry” to Midjourney and receive several images from the AI system within seconds, but Allen’s process wasn’t that simple. To get the final three images he entered in the competition, he said, took more than 80 hours.
- First, he said, he played around with phrasing that led Midjourney to generate images of women in frilly dresses and space helmets — he was trying to mash up Victorian-style costuming with space themes, he said. Over time, with many slight tweaks to his written prompt (such as to adjust lighting and color harmony), he created 900 iterations of what led to his final three images. He cleaned up those three images in Photoshop, such as by giving one of the female figures in his winning image a head with wavy, dark hair afterMidjourney had rendered her headless. Then he ran the images through another software program called Gigapixel AI that can improve resolution and had the images printed on canvas at a local print shop.
- >Cal Duran, an artist and art teacher who was one of the judges for competition, said that while Allen’s piece included a mention of Midjourney, he didn’t realize that it was generated by AI when judging it. Still, he sticks by his decision to award it first place in its category, he said, calling it a “beautiful piece”.
- >“I think there’s a lot involved in this piece and I think the AI technology may give more opportunities to people who may not find themselves artists in the conventional way,” he said.
AI Is Reliable/Addressing Hallucinations
- o1 improves over GPT-4o on a wide range of benchmarks, including 54/57 MMLU subcategories
- OpenAI o1 ranks in the 89th percentile on competitive programming questions (Codeforces), places among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME), and exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA).
- On the 2024 AIME exams, GPT-4o only solved on average 12% (1.8/15) of problems. o1 averaged 74% (11.1/15) with a single sample per problem, 83% (12.5/15) with consensus among 64 samples, and 93% (13.9/15) when re-ranking 1000 samples with a learned scoring function. A score of 13.9 places it among the top 500 students nationally and above the cutoff for the USA Mathematical Olympiad.
- We trained a model that scored 213 points and ranked in the 49th percentile in the 2024 International Olympiad in Informatics (IOI), by initializing from o1 and training to further improve programming skills. This model competed in the 2024 IOI under the same conditions as the human contestants. It had ten hours to solve six challenging algorithmic problems and was allowed 50 submissions per problem.
- With a relaxed submission constraint, we found that model performance improved significantly. When allowed 10,000 submissions per problem, the model achieved a score of 362.14 – above the gold medal threshold – even without any test-time selection strategy.
- Our evaluations closely matched competition rules and allowed for 10 submissions. GPT-4o achieved an Elo rating3 of 808, which is in the 11th percentile of human competitors. This model far exceeded both GPT-4o and o1—it achieved an Elo rating of 1807, performing better than 93% of competitors.
- https://cdn.openai.com/o1-system-card.pdf
- We find that o1-preview is less prone to selecting stereotyped options than GPT-4o, and o1-mini has comparable performance to GPT-4o-mini. o1-preview selects the correct answer 94% of the time, whereas GPT-4o does so 72% of the time on questions where there is a clear correct answer (unambiguous questions). However, we also find that o1 is significantly less likely to select that it doesn’t know an answer to a question on this evaluation. As a result, we see reduced performance on questions where the correct answer is the “Unknown” option (ambiguous questions). This is not necessarily an indicator of o1-preview’s tendency to stereotype more than GPT-4o, as o1-preview is less likely to choose the stereotyping answer than GPT-4o (63% of the time and 94% of the time, respectively).
- Note: This is the weakest model compared to o1-preview and the full o1 model
- OpenAI’s o1 model can get perfect scores on their research engineer interview coding questions:
- Reflection-Tuning enables LLMs to recognize their mistakes, and then correct them before committing to an answer.
- “Additionally, the new Mistral Large 2 is trained to acknowledge when it cannot find solutions or does not have sufficient information to provide a confident answer. This commitment to accuracy is reflected in the improved model performance on popular mathematical benchmarks, demonstrating its enhanced reasoning and problem-solving skills”
- Two things became apparent during these tests. One is that, except for a few edge cases, semantic entropy caught more false answers than any other methods. The second is that most errors produced by LLMs appear to be confabulations. That can be inferred from the fact that some of the other methods catch a variety of error types, yet they were outperformed by semantic entropy tests, even though these tests only catch confabulations.
- The researchers also demonstrate that the system can be adapted to work with more than basic factual statements by altering to handle biographies, which are a large collection of individual facts. So they developed software that broke down biographical information into a set of individual factual statements and evaluated each of these using semantic entropy. This worked on a short biography with as many as 150 individual factual claims.
- Overall, this seems to be a highly flexible system that doesn't require major new developments to put into practice and could provide some significant improvements in LLM performance. And, since it only catches confabulations and not other types of errors, it might be possible to combine it with other methods to boost performance even further.
- As the researchers note, the work also implies that, buried in the statistics of answer options, LLMs seem to have all the information needed to know when they've got the right answer; it's just not being leveraged. As they put it, "The success of semantic entropy at detecting errors suggests that LLMs are even better at 'knowing what they don’t know' than was argued... they just don’t know they know what they don’t know."
- We trained strong language models to produce text that is easy for weak language models to verify and found that this training also made the text easier for humans to evaluate.
- In this paper, we aim to alleviate the pathology by introducing Q*, a general, versatile and agile framework for guiding LLMs decoding process with deliberative planning. By learning a plug-and-play Q-value model as heuristic function, our Q* can effectively guide LLMs to select the most promising next step without fine-tuning LLMs for each task, which avoids the significant computational overhead and potential risk of performance degeneration on other tasks. Extensive experiments on GSM8K, MATH and MBPP confirm the superiority of our method.
- Over 32 techniques to reduce hallucinations:
- >Significantly outperforms few-shot prompting, SFT and other self-play methods by an average of 19% using demonstrations as feedback directly with <10 examples
- The new GPT-4o is slightly better and 33% cheaper than the old one! Right now, it's only a tad below Sonnet 3.5 on Livebench!
- Not as good as the Opus model they said is coming out later this year
- 80% lower cost than Claude 3 Opus
- 2x speed over Claude 3 Opus
- decent math and coding jump. 10% better on MATH 9% better on GPQA
- CONCLUSION: Causal reasoning is foundational to human intelligence, and has been conjectured to be necessary for achieving human level AI (Pearl, 2019). In recent years, this conjecture has been challenged by the development of artificial agents capable of generalising to new tasks and domains without explicitly learning or reasoning on causal models. And while the necessity of causal models for solving causal inference tasks has been established (Bareinboim et al., 2022), their role in decision tasks such as classification and reinforcement learning is less clear. We have resolved this conjecture in a model-independent way, showing that any agent capable of robustly solving a decision task must have learned a causal model of the data generating process, regardless of how the agent is trained or the details of its architecture. This hints at an even deeper connection between causality and general intelligence, as this causal model can be used to find policies that optimise any given objective function over the environment variables. By establishing a formal connection between causality and generalisation, our results show that causal world models are a necessary ingredient for robust and general AI.
- TLDR: a model that can reliably answer decision based questions correctly must have learned a cause and effect that led to the result.
- “Godfather of AI” and Turing Award winner Geoffrey Hinton: A neural net given training data where half the examples are incorrect still had an error rate of <=25% rather than 50% because it understands the rules and does better despite the false information: https://youtu.be/n4IQOBka8bc?si=wM423YLd-48YC-eY (14:00 timestamp)
- Klarnas AI assistant, powered by @OpenAI , has in its first 4 weeks handled 2.3 million customer service chats and the data and insights are staggering:
- Handles 2/3 rd of our customer service enquires
- On par with humans on customer satisfaction
- Higher accuracy leading to a 25% reduction in repeat inquiries
- customer resolves their errands in 2 min vs 11 min
- Live 24/7 in over 23 markets, communicating in over 35 languages
- It performs the equivalent job of 700 full time agents
- “If the physician is unhappy with the note, Lall said, they can ask the AI model to regenerate the information or add more detail to any one of the categories. While the tool has some imperfections, she said, the improvements have been noticeable over the 10 months since she began using it.“I really feel this should be the next gold standard for all of our doctors. It decreases the cognitive load you feel at the end of the day,” she said.The Ford government has been so impressed with the technology that it announced a pilot program to allow 150 family physicians to use AI Scribe as part of their practices. The health minister said the early signs were promising but stressed government would proceed carefully.”
- These are from real exams where the questions and solutions are not published online.
- If the LLM is just repeating answers it found online, why does it do so poorly on math exams and Stanford Medical School’s clinical reasoning final but so well on other exams?
- We introduce BSDETECTOR, a method for detecting bad and speculative answers from a pretrained Large Language Model by estimating a numeric confidence score for any output it generated. Our uncertainty quantification technique works for any LLM accessible only via a black-box API, whose training data remains unknown. By expending a bit of extra computation, users of any LLM API can now get the same response as they would ordinarily, as well as a confidence estimate that cautions when not to trust this response. Experiments on both closed and open-form Question-Answer benchmarks reveal that BSDETECTOR more accurately identifies incorrect LLM responses than alternative uncertainty estimation procedures (for both GPT-3 and ChatGPT). By sampling multiple responses from the LLM and considering the one with the highest confidence score, we can additionally obtain more accurate responses from the same LLM, without any extra training steps. In applications involving automated evaluation with LLMs, accounting for our confidence scores leads to more reliable evaluation in both human-in-the-loop and fully-automated settings (across both GPT 3.5 and 4).
- ChipNeMo-70B, outperforms the highly capable GPT-4 on two of our use cases, namely engineering assistant chatbot and EDA scripts generation, while exhibiting competitive performance on bug summarization and analysis. These results underscore the potential of domain-specific customization for enhancing the effectiveness of large language models in specialized applications: https://arxiv.org/pdf/2311.00176
- “this work has used the free version of ChatGPT (GPT-3.5) for acquiring the ChatGPT responses for the manual analysis.”
- “Thus, we chose to only consider the initial answer generated by ChatGPT.”
- “To understand how differently GPT-4 performs compared to GPT-3.5, we conducted a small analysis on 21 randomly selected SO questions where GPT-3.5 gave incorrect answers. Our analysis shows that, among these 21 questions, GPT-4 could answer only 6 questions correctly, and 15 questions were still answered incorrectly.”
- This is an extra 28.6% on top of the 48% that GPT 3.5 was correct on, totaling to ~77% for GPT 4 (equal to (517*0.48+517*6/21)/517) if we assume that GPT 4 correctly answers all of the questions that GPT 3.5 correctly answered, which is highly likely considering GPT 4 is far higher quality than GPT 3.5.
- Note: This was all done in ONE SHOT with no repeat attempts.
- Study was released before GPT-4o and did not use GPT 4 Turbo, both of which are significantly higher quality than GPT 4 according to the LMSYS arena. Now, this has also been surpassed by Claude Sonnet 3.5 with Opus 3.5 coming later this year.
- “We tested AutoDev on the HumanEval dataset, obtaining promising results with 91.5% and 87.8% of Pass@1 for code generation and test generation respectively, demonstrating its effectiveness in automating software engineering tasks while maintaining a secure and user-controlled development environment.”
- “In a trial run by GitHub’s researchers, developers given an entry-level task and encouraged to use the program, called Copilot, completed their task 55 percent faster than those who did the assignment manually.”
- Language models are rarely shown fruitful mistakes while training. They then struggle to look beyond the next token, suffering from a snowballing of errors and struggling to predict the consequence of their actions several steps ahead. In this paper, we show how language models can be taught to search by representing the process of search in language, as a flattened string — a stream of search (SoS). We propose a unified language for search that captures an array of different symbolic search strategies. We demonstrate our approach using the simple yet difficult game of Countdown, where the goal is to combine input numbers with arithmetic operations to reach a target number. We pretrain a transformer-based language model from scratch on a dataset of streams of search generated by heuristic solvers. We find that SoS pretraining increases search accuracy by 25% over models trained to predict only the optimal search trajectory. We further finetune this model with two policy improvement methods: Advantage-Induced Policy Alignment (APA) and Self-Taught Reasoner (STaR). The finetuned SoS models solve 36% of previously unsolved problems, including problems that cannot be solved by any of the heuristic solvers. Our results indicate that language models can learn to solve problems via search, self-improve to flexibly use different search strategies, and potentially discover new ones.
- - The paper presents MetaSumPerceiver (MSP), a novel summarization model for generating claim-specific summaries from multimodal, multi-document datasets to assist with fact-checking.
- - MSP uses a dynamic perceiver-based model to handle multimodal inputs of arbitrary lengths, including documents, images, and claims.
- - To train MSP, reinforcement learning with an entailment model as a reward signal is employed to refine summaries to provide relevant evidence for fact-checking.
- - MSP incorporates a proxy reward mechanism with PPO to continually update the summarizer during fact-checking.
- - The paper introduces the Multi-News-Fact-Checking dataset with over 100k labeled claims derived from Multi-News using Llama prompts.
- - Experiments on MOCHEG and the new dataset show MSP substantially outperforms prior baselines, achieving state-of-the-art performance.
- - The key innovation is using RL to optimize summarization specifically for claim verification versus generic summarization.
- Yum Brands said the tech has improved order accuracy, reduced wait times, decreased employees’ task load and fueled profitable growth.
- GPT-4 gets the classic riddle of “which order should I carry the chickens or the fox over a river” correct EVEN WITH A MAJOR CHANGE if you replace the fox with a "zergling" and the chickens with "robots".
Proof: https://chatgpt.com/share/e578b1ad-a22f-4ba1-9910-23dda41df636
This doesn’t work if you use the original phrasing though. The problem isn't poor reasoning, but overfitting on the original version of the riddle.
- LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA: https://huggingface.co/papers/2409.02897
- Stanford researchers: “Automating AI research is exciting! But can LLMs actually produce novel, expert-level research ideas? After a year-long study, we obtained the first statistically significant conclusion: LLM-generated ideas are more novel than ideas written by expert human researchers." https://x.com/ChengleiSi/status/1833166031134806330
- >Coming from 36 different institutions, our participants are mostly PhDs and postdocs. As a proxy metric, our idea writers have a median citation count of 125, and our reviewers have 327.
- >We also used an LLM to standardize the writing styles of human and LLM ideas to avoid potential confounders, while preserving the original content.
- Our bot performs better than experienced human forecasters and performs roughly the same as (and sometimes even better than) crowds of experienced forecasters
- Our bot and other forecasting bots can be used in a wide variety of contexts. For example, these AIs could help policymakers minimize bias in their decision-making or help improve the information ecosystem by providing trustworthy, calibrated forecasts.
- On the 177 events, the Metaculus crowd got 87.0% accuracy, while FiveThirtyNine got 87.7% ± 1.4. A link to the technical report is here. This bot lacks many of the drawbacks of prediction markets. It makes forecasts within seconds. Additionally, groups of humans do not need to be incentivized with cash prizes to make and continually update their predictions. Forecasting AIs are several orders of magnitude faster and cheaper than prediction markets, and they’re similarly accurate.
- The bot is not fine-tuned, and doing so could potentially make it far more accurate. It simply retrieves articles and writes a report as guided through an engineered prompt. (Its prompt can be found by clicking on the gear icon in forecast.safe.ai.) Moreover, probabilities from AIs are also known to lead to automation bias, and improvements in the interface could ameliorate this.
- PaperQA2 is also the first agent to beat PhD and Postdoc-level biology researchers on multiple literature research tasks, as measured both by accuracy on objective benchmarks and assessments by human experts. We are publishing a paper and open-sourcing the code.
- This is the first example of AI agents exceeding human performance on a major portion of scientific research, and will be a game-changer for the way humans interact with the scientific literature.
- PaperQA2 finds and summarizes relevant literature, refines its search parameters based on what it finds, and provides cited, factually grounded answers that are more accurate on average than answers provided by PhD and postdoc-level biologists. When applied to answer highly specific questions, like this one, it obtains SOTA performance on LitQA2, part of LAB-Bench focused on information retrieval
- PaperQA2 can also do broad-based literature reviews. WikiCrow, which is an agent based on PaperQA2, writes Wikipedia-style articles that are significantly more accurate on average than actual human-written articles on Wikipedia, as judged by PhD and postdoc-level biologists.
8.1. Math
- >It combines AlphaProof, a new breakthrough model for formal reasoning, and AlphaGeometry 2, an improved version of our previous system.
- Powered with a novel search algorithm, AlphaGeometry 2 can now solve 83% of all historical problems from the past 25 years - compared to the 53% rate by its predecessor.
- It solved this year’s IMO Problem 4 within 19 seconds
- The fact that the program can come up with a non-obvious construction like this is very impressive, and well beyond what I thought was state of the art. -PROF SIR TIMOTHY GOWERS, IMO GOLD MEDALIST AND FIELDS MEDAL WINNER
- Math professor on DeepMind's breakthrough: "When people saw Sputnik 1957, they might have had same feeling I do now. Human civ needs to move to high alert" https://x.com/PoShenLoh/status/1816500461484081519
- o1 improves over GPT-4o on a wide range of benchmarks, including 54/57 MMLU subcategories
- OpenAI o1 ranks in the 89th percentile on competitive programming questions (Codeforces), places among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME), and exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA).
- On the 2024 AIME exams, GPT-4o only solved on average 12% (1.8/15) of problems. o1 averaged 74% (11.1/15) with a single sample per problem, 83% (12.5/15) with consensus among 64 samples, and 93% (13.9/15) when re-ranking 1000 samples with a learned scoring function. A score of 13.9 places it among the top 500 students nationally and above the cutoff for the USA Mathematical Olympiad.
- We trained a model that scored 213 points and ranked in the 49th percentile in the 2024 International Olympiad in Informatics (IOI), by initializing from o1 and training to further improve programming skills. This model competed in the 2024 IOI under the same conditions as the human contestants. It had ten hours to solve six challenging algorithmic problems and was allowed 50 submissions per problem.
- With a relaxed submission constraint, we found that model performance improved significantly. When allowed 10,000 submissions per problem, the model achieved a score of 362.14 – above the gold medal threshold – even without any test-time selection strategy.
- Our evaluations closely matched competition rules and allowed for 10 submissions. GPT-4o achieved an Elo rating3 of 808, which is in the 11th percentile of human competitors. This model far exceeded both GPT-4o and o1—it achieved an Elo rating of 1807, performing better than 93% of competitors.
- Abacus Embeddings, a simple tweak to positional embeddings that enables LLMs to do addition, multiplication, sorting, and more. Our Abacus Embeddings trained only on 20-digit addition generalise near perfectly to 100+ digits: https://x.com/SeanMcleish/status/1795481814553018542
- The accuracy is a perfect 100%, while GPT-4 has 0% accuracy
- >Tao: I think in three years AI will become useful for mathematicians.
- >Utilizing this fully automated process supervision alongside the weighted self-consistency algorithm, we have enhanced the instruction tuned Gemini Pro model's math reasoning performance, achieving a 69.4\% success rate on the MATH benchmark, a 36\% relative improvement from the 51\% base model performance. Additionally, the entire process operates without any human intervention, making our method both financially and computationally cost-effective compared to existing methods.
- >Extensive experiments demonstrate MCTSr's efficacy in solving Olympiad-level mathematical problems, significantly improving success rates across multiple datasets, including GSM8K, GSM Hard, MATH, and Olympiad-level benchmarks, including Math Odyssey, AIME, and OlympiadBench. The study advances the application of LLMs in complex reasoning tasks and sets a foundation for future AI integration, enhancing decision-making accuracy and reliability in LLM-driven applications.
- This would be even more effective with a better model than LLAMA 8B
- It even impressed Fields medalist Terrance Tao
- Not as good as the Opus model they said is coming out later this year
- 'A different source briefed on the matter said OpenAI has tested AI internally that scored over 90% on a MATH dataset, a benchmark of championship math problems. Reuters could not determine if this was the "Strawberry" project.'
- The MATH dataset is challenging: large language models achieved accuracies ranging from 3.0% to 6.9%. Despite these low accuracies, models clearly possess some mathematical knowledge: they achieve up to 15% accuracy on the easiest difficulty level, and they are able to generate step-by-step solutions that are coherent and on-topic even when incorrect. We also evaluated humans on MATH, and found that a computer science PhD student who does not especially like mathematics attained approximately 40% on MATH, while a three-time IMO gold medalist attained 90%, indicating that MATH can be challenging for humans as well.
- Trained on new competition math dataset ever released, with 860K problem solution pairs.
- For AMC 10 2023, the average score is 43%,
- AIME Floor: 70% (top ~6%)
- Distinction: 75%
- Distinguished Honor Roll: 90%
- https://mathvista.github.io/
- test: 5,141 examples for standard evaluation. Notably, the answer labels for test will NOT be publicly released
- Human performance is 60.8%. Average human performance is from AMT annotators who have high school diplomas or above.
8.2. Medicine
- Recall (false positive) rate and radiologist reading workload decreased significantly in AI-screened group
- Using AI, breast radiologists in Denmark have improved breast cancer screening performance and reduced the rate of false-positive findings.
- In total, 60,751 women were screened without AI, and 58,246 women were screened with the AI system. In the AI implementation group, 66.9% (38,977) of the screenings were single-read, and 33.1% (19,269) were double-read with AI assistance.
- Compared to screening without AI, screening with the AI system detected significantly more breast cancers (0.82% versus 0.70%) and had a lower false-positive rate (1.63% versus 2.39%).
- “In the AI-screened group, the recall rate decreased by 20.5 percent, and the radiologists’ reading workload was lowered by 33.4 percent,” Dr. Lauritzen said.
- The positive predictive value of AI screening was also greater than that of screening without AI (33.5% versus 22.5%). In the AI group, a higher proportion of invasive cancers detected were 1 centimeter or less in size (44.93% vs. 36.60%).
- “All screening performance indicators improved except for the node-negative rate which showed no evidence of change,” Dr. Lauritzen said.
- Scanning images of cells could lead to new diagnostic and monitoring strategies for disease.
- AI detected nearly seven percent more significant prostate cancers than the radiologists. Moreover, AI triggered false alarms 50 percent less often, potentially reducing the number of unnecessary biopsies by half. These findings suggest that AI could significantly alleviate the workload of radiologists, improve diagnostic accuracy, and minimize unnecessary procedures.”
- >We evaluate Med-Gemini on 14 medical benchmarks, establishing new state-of-the-art (SoTA) performance on 10 of them, and surpass the GPT-4 model family on every benchmark where a direct comparison is viable, often by a wide margin. On the popular MedQA (USMLE) benchmark, our best-performing Med-Gemini model achieves SoTA performance of 91.1% accuracy, using a novel uncertainty-guided search strategy. On 7 multimodal benchmarks including NEJM Image Challenges and MMMU (health & medicine), Med-Gemini improves over GPT-4V by an average relative margin of 44.5%. We demonstrate the effectiveness of Med-Gemini's long-context capabilities through SoTA performance on a needle-in-a-haystack retrieval task from long de-identified health records and medical video question answering, surpassing prior bespoke methods using only in-context learning. Finally, Med-Gemini's performance suggests real-world utility by surpassing human experts on tasks such as medical text summarization, alongside demonstrations of promising potential for multimodal medical dialogue, medical research and education.
- >Human doctors + AI did worse, than AI by itself. The mere involvement of a human reduced the accuracy of the diagnosis.
- AI was consistently rated to have better bedside manner than human doctors.
- >Humans near chance level (50-55% accuracy) on rarest conditions, while CheXzero maintains 64-68% accuracy.
- >Remarkably, AI doctors can treat 10,000 [simulated] patients in just a few days. It would take human doctors at least two years to treat that many patients. Furthermore, evolved doctor agents achieved an impressive 93.06 percent accuracy rate on the MedQA dataset (US Medical Licensing Exam questions) covering major respiratory diseases. They simulate the entire process of diagnosing and treating patients, including consultation, examination, diagnosis, treatment and follow-up.
Morality/AI Is Not Theft
- Imagine if the first dude who painted a wall didn't give "consent" for some other dude to "copy" his cave painting. We wouldn't have art today.
- AI training is similar to how people learn now. They read/see other people’s work, which is usually copyrighted, and get inspired to make their own. If it is moral for humans to do that without permission or compensation (even if they make money from it), it is moral for AI to do the same.
- And yes, AI does not learn like humans do. Birds and planes are fundamentally different too. But they both fly even if their method of doing it is not the same.
- Example 1: The director Breaking Bad, Vince Gilligan, stated that The Godfather was a main inspiration for the show. Yet the owner of The Godfather was not paid any royalties for it despite Breaking Bad being a for-profit show.
- Example 2: People draw fan art all the time without paying or getting permission, including profiting from it on Patreon or Gumroad and potentially damaging the brands by creating NSFW content of copyrighted characters.
- Example 3: Everyone is a product of the sum of their experiences similar to how AI is a product of analyzing countless images or texts. Yet AI is expected to credit or compensate for everything it analyzes while humans are not.
- “Why are you defending big corporations?”
- The fact big corporations benefit does not make AI bad. Vaccines and the Internet are also defensible even though they help Big Pharma or ISPs profit.
- AI voice generation is not much different from someone doing a good voice impression. Both can be used for malicious purposes, but we still find the latter acceptable.
- Twitter/X is banning millions of accounts a year for CSAM, but we don’t consider Twitter to be inherently immoral because CSAM exists on their servers. Services need to make a reasonable, good faith effort to remove this material anywhere it is found and work with law enforcement to help catch the perpetrators. In this case, a small amount of CSAM was found in some of the LAION training links used to train Stabke Diffusion, LAION was made aware, and they took all their datasets offline temporarily while they removed those links from the listings.
- That is the response of an organization making a good faith effort to remove CSAM from their work. Instead of pearl clutching any time someone does something illegal on the internet, we should look at what actions we expect companies and services to take, and ensure they are taking those actions.
- If our expectation is that any company that unknowingly has CSAM on their site should be permanently shut down, that will be an impossible task.
Legality
- courts are rejecting that just because an ai model was trained on a copyrighted work it or its output is an infringing work. In fact, it finds such claims either have no legal basis or it is so unlikely to be true that that the court don’t think worth letting such a claim go on to discovery.
- the law firm doing these cases couldn’t come up with a legal theory of how the trainers of the model were removing copyright information illegally.
- The removal of copyright management information (“CMI,” which includes information such as the title, the copyright holder, and other identifying information in a copyright notice) is a claim included in almost all plaintiffs’ complaints in the AI lawsuits, and this claim has failed to survive motions to dismiss without exception. DMCA Section 1202(b) restricts the intentional, unauthorized removal of CMI. Experts initially considered DMCA 1202(b) one of the biggest hurdles for non-licensed AI training. But courts so far have dismissed all DMCA 1202(b) claims, including in J. Doe 1 v. GitHub, Tremblay v. OpenAI, Andersen v. Stability AI, Kadrey v. Meta Platforms, and Silverman v. OpenAI. The plaintiffs’ DMCA Section 1202(b)(1) claims have failed because plaintiffs were not able to offer any evidence showing their CMI has been intentionally removed by the AI companies. For example, in Tremblay v. OpenAI and Silverman v. OpenAI, the courts held that the plaintiffs did not argue plausibly that OpenAI has intentionally removed CMI when ingesting plaintiffs’ works for training. Additionally, plaintiffs’ DMCA Section 1202(b)(3) have failed thus far because the plaintiffs’ claims did not fulfill the identicality requirement. For example, in J. Doe 1 v. GitHub, the court pointed out that Copilot’s output did not tend to represent verbatim copies of the original ingested code. We now see plaintiffs voluntarily dropping the DMCA claims in their amended complaints, such as in Leovy v Google (formerly J.L. vs Alphabet).
- Another claim that has been consistently dismissed by courts is that AI models are infringing derivative works of the training materials. The law defines a derivative work as “a work based upon one or more preexisting works, such as a translation, musical arrangement, … art reproduction, abridgment, condensation, or any other form in which a work may be recast, transformed, or adapted.” To most of us, the idea that the model itself (as opposed to, say, outputs generated by the model) can be considered a derivative work seems to be a stretch. The courts have so far agreed. On November 20, 2023, the court in Kadrey v. Meta Platforms said it is “nonsensical” to consider an AI model a derivative work of a book just because the book is used for training.
- Similarly, claims that all AI outputs should be automatically considered infringing derivative works have been dismissed by courts, because the claims cannot point to specific evidence that an instance of output is substantially similar to an ingested work. In Andersen v. Stability AI, plaintiffs tried to argue “that all elements of … Anderson’s copyrighted works … were copied wholesale as Training Images and therefore the Output Images are necessarily derivative;” the court dismissed the argument because—besides the fact that plaintiffs are unlikely able to show substantial similarity—“it is simply not plausible that every Training Image used to train Stable Diffusion was copyrighted … or that all … Output Images rely upon (theoretically) copyrighted Training Images and therefore all Output images are derivative images. … [The argument for dismissing these claims is strong] especially in light of plaintiffs’ admission that Output Images are unlikely to look like the Training Images.”
- Several of these AI cases have raised claims of vicarious liability—that is, liability for the service provider based on the actions of others, such as users of the AI models. Because a vicarious infringement claim must be based on a showing of direct infringement, the vicarious infringement claims are also dismissed in Tremblay v. OpenAI and Silverman v. OpenAI, when plaintiffs cannot point to any infringing similarity between AI output and the ingested books.
- Many plaintiffs have also raised a number of non-copyright, state law claims (such as negligence or unfair competition) that have largely been dismissed based on copyright preemption. Copyright preemption prevents duplicitous state law claims when those state law claims are based on an exercise of rights that are equivalent to those provided for under the federal Copyright Act. In Andersen v. Stability AI, for example, the court dismissed the plaintiffs’ unjust enrichment claim because the plaintiffs failed to add any new elements that would distinguish their claim based on California’s Unfair Competition Law or common law from rights under the Copyright Act.
- It is interesting to note that many of the dismissed claims in different AI lawsuits closely mimic one another, such as in Kadrey v. Meta Platforms, Andersen v. Stability AI, Tremblay v. OpenAI, and Silverman v. OpenAI. It turns out that the similarities are no coincidence—all these lawsuits are filed by the same law firm. These mass-produced complaints not only contain overbroad claims that are prone to dismissal, they also have overbroad class designations.
- If this can be legally sold for commercial use, why not AI art?
- AI art is inherently transformative* and, therefore, fair use.
- *This is why it warps hands, fingers, limbs, eyes, etc. sometimes. If it was just copying and pasting existing images, why would it do that? Additionally, Stable Diffusion 1.5 checkpoints are only 2 GB, not nearly enough space to store anywhere close to all the images it was trained on. It can also generate infinite variations of absurd or extremely strange images that would not be well represented in its training data.
- Stable Diffusion models can generate novel images of characters that only existed AFTER the model was trained and released if it uses a Lora trained on those characters. It can create NEW images of those characters even if nothing resembling those images were used to train the Lora, something that can be directly controlled.
- In other words, if a brand new character is released, I can train a Lora on it, and SD can create new images of that character that I can verify it was NEVER trained on since it won’t be in the dataset used to train the Lora.
- AI training off of data and taking jobs is similar to a human being inspired by a work and taking market share from their inspirations by creating their own. This is not considered immoral, and most artists are generally honored if someone was inspired by their work.
- Additionally, jobs are frequently automated, such as how milkmen lost their jobs to the rise of supermarkets, coal miners lost their jobs to renewable energy, horse carriage manufacturers lost their jobs to cars, and many mailmen lost their jobs to email. This would not justify banning any of those things and unemployment rates remained low despite these displacements.
- AI cannot displace artists as artists are still needed to create ideas, integrate scenes together, and fix mistakes. In fact, it can be a great improvement as it will reduce the amount of labor they will need to do that often leads to extreme burnout.
- Even if you believe AI will take jobs, artists are not entitled to jobs and would still be allowed to create art on their own time. However, no one is obligated to employ them. If you have a problem with that, blame capitalism and the requirement of wage labor even when it is no longer needed instead of AI.
- Singapore: Copyright Infringement Defence for AI Machine Learning: https://rouse.com/insights/news/2024/artificial-intelligence-in-singapore-copyright-infringement-defence-for-artificial-intelligence-machine-learning
- Machine learning requires data and information to learn from and typically this comes from content scraped from third party websites which could potentially be regarded as copyright infringement. This risk could now be neutralised by the Computational Analysis defense of the CA - under section 243
- Even if we agree that AI generated images are theft, the users are the ones generating the images, while the AI companies are only providing access to their model. Therefore, the users are the ones violating copyright, not the companies themselves, so they are not liable similar to how photo editing software companies are not liable if a user commits plagiarism using their software.
- US Copyright Law - Chapter 1 Section 102 " In no case does copyright protection for an original work of authorship extend to any idea, procedure, process, system, method of operation, concept, principle, or discovery, regardless of the form in which it is described, explained, illustrated, or embodied in such work."
- “In January 2024, Bright Data won a legal dispute with Meta. A federal judge in San Francisco declared that Bright Data did not breach Meta's terms of use by scraping data from Facebook and Instagram, consequently denying Meta's request for summary judgment on claims of contract breach.[20][21][22] This court decision in favor of Bright Data’s data scraping approach marks a significant moment in the ongoing debate over public access to web data, reinforcing the freedom of access to public web data for anyone.”
- “In May 2024, a federal judge dismissed a lawsuit by X Corp. (formerly Twitter) against Bright Data, ruling that the company did not violate X's terms of service or copyright by scraping publicly accessible data.[25] The judge emphasized that such scraping practices are generally legal and that restricting them could lead to information monopolies,[26] and highlighted that X's concerns were more about financial compensation than protecting user privacy.”
- >The most recently dismissed claims were fairly important, with one pertaining to infringement under the Digital Millennium Copyright Act (DMCA), section 1202(b), which basically says you shouldn't remove without permission crucial "copyright management" information, such as in this context who wrote the code and the terms of use, as licenses tend to dictate.
- The amended complaint argued that unlawful code copying was an inevitability if users flipped Copilot's anti-duplication safety switch to off, and also cited a study into AI-generated code in attempt to back up their position that Copilot would plagiarize source, but once again the judge was not convinced that Microsoft's system was ripping off people's work in a meaningful way.
- 3 separate gen AI cases show that AI training is not the same as removing a watermark: https://www.law.cornell.edu/uscode/text/17/1202
- I can go to a library and study math. The textbook authors cannot claim license to my work. The ai is not too different.
- If I use your textbook to pass my classes, get a PhD, and publish my own competing textbook, you can’t sue even if my textbook teaches the same topics as yours and becomes so popular that it causes your market share to significantly decrease. Note that the textbook is a product sold for profit that directly competes with yours, not just an idea in my head. Yet I owe no royalties to you.
- Downloading images is not infringement as browsers download images for you to view them
- Derivative work only includes major copyrightable elements of the original. AI does not contain that.
- Tool manufacturers are not responsible if a user violates copyright with it. For example, Adobe is not responsible if someone violates copyright using Photoshop.
- Search engines already “stole content” long before AI, such as Google’s featured snippets and “People Also Ask” sections that quote text from a website directly with no alterations: https://support.google.com/websearch/answer/9351707?hl=en&visit_id=638612725182118407-343145636&p=featured_snippets&rd=1
- (Ignore the Japanese text; it’s from a browser extension)
AI Art
11.1. Images/Videos/3D Modeling
- The study identified 350,000 images in the training data to target for retrieval with 500 attempts each (totaling 175 million attempts), and of that managed to retrieve 107 images through high cosine similarity (85% or more) of their CLIP embeddings and through manual visual analysis. A replication rate of nearly 0% in a dataset biased in favor of overfitting using the exact same labels as the training data and specifically targeting images they knew were duplicated many times in the dataset using a smaller model of Stable Diffusion (890 million parameters vs. the larger 12 billion parameter Flux model that released on August 1). This attack also relied on having access to the original training image labels:
- “Instead, we first embed each image to a 512 dimensional vector using CLIP [54], and then perform the all-pairs comparison between images in this lower-dimensional space (increasing efficiency by over 1500×). We count two examples as near-duplicates if their CLIP embeddings have a high cosine similarity. For each of these near-duplicated images, we use the corresponding captions as the input to our extraction attack.”
- There is not as of yet evidence that this attack is replicable without knowing the image you are targeting beforehand. So the attack does not work as a valid method of privacy invasion so much as a method of determining if training occurred on the work in question - and only for images with a high rate of duplication AND with the same prompts as the training data labels, and still found almost NONE.
- “On Imagen, we attempted extraction of the 500 images with the highest out-ofdistribution score. Imagen memorized and regurgitated 3 of these images (which were unique in the training dataset). In contrast, we failed to identify any memorization when applying the same methodology to Stable Diffusion—even after attempting to extract the 10,000 most-outlier samples”
- I do not consider this rate or method of extraction to be an indication of duplication that would border on the realm of infringement, and this seems to be well within a reasonable level of control over infringement.
- Diffusion models can create human faces even when an average of 93% of the pixels are removed from all the images in the training data: https://arxiv.org/pdf/2305.19256
- “if we corrupt the images by deleting 80% of the pixels prior to training and finetune, the memorization decreases sharply and there are distinct differences between the generated images and their nearest neighbors from the dataset. This is in spite of finetuning until convergence.”
- “As shown, the generations become slightly worse as we increase the level of corruption, but we can reasonably well learn the distribution even with 93% pixels missing (on average) from each training image.”
- Prompt with no additional editing: meme image with two men in it. On the left side the man is taller and is wearing a shirt that says Black Forest Labs. On the right side the other smaller scrawny man is wearing a shirt that says Stability AI and is sad. The taller man is hitting the back of the head of the small man. A caption coming from the tall man reads "That's how you do a next-gen model!
- Prompt: Gameplay screenshot of Counter Strike Global Offensive. It takes place in a Middle Eastern place called Dust 2. There are enemy soldiers shooting at you.
- Prompt: low quality and motion blur shaky photo of a CRT television on top of a wooden drawer in an average bedroom. The lighting from is dim and warm ceiling light that is off screen. In the TV there is Dark Souls videogame gameplay on it. The screen of the TV is overexposed.
- Created on first try. Robe and hands are perfect
- First attempt: "Photo of a red sphere on top of a blue cube. Behind them is a green triangle, on the right of the triangle is a dog, on the left is a cat."
- Prompt: person take photo of Graffiti art spelling out the words "WAFERSELAMAT", graffiti, white wall, dynamic color, spray paint,
- Prompt: Close-up of LEGO chef minifigure cooking for homeless. Focus on LEGO hands using utensils, showing culinary skill. Warm kitchen lighting, late morning atmosphere. Canon EOS R5, 50mm f/1.4 lens. Capture intricate cooking techniques. Background hints at charitable setting. Inspired by Paul Bocuse and Massimo Bottura's styles. Freeze-frame moment of food preparation. Convey compassion and altruism through scene details.
- A family of multimodal autoregressive models capable of various vision and language tasks, particularly excelling in generating flexible photorealistic images from text descriptions.
- Our model significantly outperforms previous editing models as judged by human raters.
- We show that, through the adversarial training, the multi-steps video diffusion model, i.e., Stable Video Diffusion (SVD), can be trained to perform single forward pass to synthesize high-quality videos, capturing both temporal and spatial dependencies in the video data. Extensive experiments demonstrate that our method achieves competitive generation quality of synthesized videos with significantly reduced computational overhead for the denoising process (i.e., around 23 times speedup compared with SVD and 6 times speedup compared with existing works, with even better generation quality), paving the way for real-time video synthesis and editing.
- Specifically, we propose spatio-temporal compositional diffusion to precisely follow complex textual semantics by manipulating and composing the attention maps of denoising networks spatially and temporally. Moreover, we propose an enhanced video data preprocessing to enhance the training data regarding motion dynamics and prompt understanding, equipped with a new reference frame attention mechanism to improve the consistency of auto-regressive video generation. Extensive experiments demonstrate that our VideoTetris achieves impressive qualitative and quantitative results in compositional T2V generation.
- - Generates high-quality images up to 512×512 pixels🎯
- - Uses efficient row-wise attention to reduce computation⚡
- - 12x more efficient than sota methods💪
- The method also supports adding changes to the inputs through additional text prompts.
- our method, called SparseCraft, achieves state-of-the-art performances both in novel-view synthesis and reconstruction from sparse views in standard benchmarks, while requiring less than 10 minutes for training.
https://www.reddit.com/r/StableDiffusion/s/f46LKOMj7q
- The method is able to capture garment details and human posture, resulting in more coherent and lifelike videos.
- [Like photography, AI art is more complicated than a single button press](https://www.reddit.com/r/StableDiffusion/comments/1cjvw3k/image_realistic_composite_refine_comfyui_workflow/)
- AI art generators are tools, similar to Photoshop or how cameras create images rather than the photographers yet we still consider them artists.
- Digital art is still considered art even though not every part is hand-drawn (e.g. anti-aliasing, paint bucket tool, creating shapes, etc) and not every part of a photograph was set by the photographer (e.g. photos/videos of nature)
- Intentionality does not matter in art, such as in action painting (where paint is “spontaneously dribbled, splashed or smeared onto the canvas, rather than being carefully applied” and the artist has no idea what the result will look like), and the works of famous artists like Jackson Pollock
- AI art is very similar to photography. Both can be as simple as clicking a button or be much more complex. For example, creating with Stable Diffusion can involve using ControlNet, IPAdapter, animation extensions, very complicated ComfyUI workflows, and much more to get the result you want. Additionally, both involve a machine doing most of the actual creation process, where the camera/AI creates the images, while the artist guides it on what the end result should be and completes post-processing work.
- Creating TV shows with diffusion: https://fablestudio.github.io/showrunner-agents/
- Very good control of output with text: https://ella-diffusion.github.io/
- Image Consistency:
- AI images are getting VERY realistic:
- It is designed for manipulating large objects within contextual environments, guided by human-level instructions.
11.2. Quality/Soul
See section 15.3 for awards won and accomplishments
- The novel draws from Shupe’s eventful life, including her advocacy for more inclusive gender recognition.
- Shupe believes fervently that she was only able to complete her book with the assistance of generative AI tools. She says she has been assessed as 100 percent disabled by the Department of Veterans Affairs and struggles to write due to cognitive impairment related to conditions including bipolar disorder, borderline personality disorder, and a brain stem malformation.
- She is proud of the finished work and sees working with a text generator as a different but no less worthwhile method of expressing thoughts. “You don't just hit ‘generate’ and get something worthy of publishing. That may come in the future, but we're still far from it,” she says, noting that she spent upwards of 14 hours a day working on her draft.
- Prompt: Gameplay screenshot of Counter Strike Global Offensive. It takes place in a Middle Eastern place called Dust 2. There are enemy soldiers shooting at you.
- Prompt: low quality and motion blur shaky photo of a CRT television on top of a wooden drawer in an average bedroom. The lighting from is dim and warm ceiling light that is off screen. In the TV there is Dark Souls videogame gameplay on it. The screen of the TV is overexposed.
- Created on first try. Robe and hands are perfect
- First attempt: "Photo of a red sphere on top of a blue cube. Behind them is a green triangle, on the right of the triangle is a dog, on the left is a cat."
- Prompt: person take photo of Graffiti art spelling out the words "WAFERSELAMAT", graffiti, white wall, dynamic color, spray paint,
- Prompt: Close-up of LEGO chef minifigure cooking for homeless. Focus on LEGO hands using utensils, showing culinary skill. Warm kitchen lighting, late morning atmosphere. Canon EOS R5, 50mm f/1.4 lens. Capture intricate cooking techniques. Background hints at charitable setting. Inspired by Paul Bocuse and Massimo Bottura's styles. Freeze-frame moment of food preparation. Convey compassion and altruism through scene details.
- “Runway's tools and AI models have been utilized in films such as Everything Everywhere All At Once, in music videos for artists including A$AP Rocky, Kanye West, Brockhampton, and The Dandy Warhols, and in editing television shows like The Late Show and Top Gear.”
- Kling has realistic video generation despite a lack of compute:
- Popular AI generated memes:
- AI images are getting very realistic
- Kling has realistic video generation despite a lack of compute:
11.3. Glaze/Nightshade
- “Noise offset, as described by crosslabs's article works by adding a small non-0 number to the latent image before passing it to the diffuser. This effectively increases the most contrast possible by making the model see more light/dark colors. Glaze and Nightshade effectively add noise to the images, acting as a sort of noise offset at train time. This can explain why images generated with LoRAs trained with glazed images look better than non-glazed images.”
- Glaze does not work according to ETH Zurich lab: https://spylab.ai/blog/glaze/
- Easy to detect and filter from datasets: https://github.com/RichardAragon/NightshadeAntidote?darkschemeovr=1
- Also easy to remove the poison:
- Ruins image quality:
11.4. Music
- A few very popular songs:
- AI song covers with have hundreds of thousands or even millions of views each:
- Unbeknownst to Metro at the time, the original track's vocals and instrumental were generated entirely by an artificial intelligence model.
- Upon release, the track immediately received widespread attention on social media platforms. Notable celebrities and internet personalities including Elon Musk and Dr. Miami reacted to the beat.[19][20] Several corporations also responded, including educational technology company Duolingo and meat producer Oscar Mayer.[21][20]
- In addition to users releasing freestyle raps over the instrumental, the track also evolved into a viral phenomenon where users would create remixes of the song beyond the hip hop genre.[22] Many recreated the song in other genres, including house, merengue and Bollywood.[23][18] Users also created covers of the song on a variety of musical instruments, including on saxophone, guitar and harp.
- 3.88/5 with 613 reviews on Rate Your Music (the best albums of ALL time get about a ⅘ on the site)
- 86 on Album of the Year (qualifies for an orange star denoting high reviews from fans despite multiple anti AI negative review bombers)
- Cinematic neoclassical/Medieval/Folk/Fantasy (all excellent!)
- Won $10k from him and a free remix in a competition
- 3.88/5 on Rate Your Music with 612 reviews. The best albums of all time get around a 4/5.
- "BBL Drizzy" quickly went viral, generating more than 3.3 million streams on SoundCloud within a week
11.5. Artists Who Support or Use AI
- Art created using artificial intelligence (AI) is burgeoning. From commercial gallery shows—including Jon Rafman’s large-scale, algorithmically generated paintings at Sprüth Magers in London (Ebrah k’dabri, until 25 March)—to the PATH-AI artist residency organised in collaboration with London’s Somerset House, AI-related art projects are springing up everywhere.
- Artists in the field stress that AI is prompting a paradigm shift. Rafman says: "I have been using AI in one form or another since I began making art on computers in the 1990s. I only truly started using image-generating AI tools around 2020."
- His 40-minute film at Sprüth Magers, Counterfeit Poast (2023), is entirely generated from AI imagery; the characters in it are animated using an iPhone facial motion-capture app. “AI has the potential to open the gates for new perceptions of image-making just as the development of photography liberated painting from pure factual representation and allowed painters to focus on other dimensions, such as colour, light, and movement,” Rafman adds.
- The German digital artist Mario Klingemann has been working with AI since 2015, developing works such as Memories of Passersby 1 (2018), which employ a system of neural networks to generate a never-ending stream of portraits. “I think artists should embrace or at least try out the possibilities that AI offers,” he says. “This technology will become the new normal.”Klingemann explains how he harnesses AI, creating works where the boundaries between human influence and machine creation become increasingly blurred. Botto, for instance, is a project to create an entity that can be perceived as an autonomous artist. “It is set up as a hybrid between an AI that makes its own creative decisions and a community of human stewards that vote on Botto’s proposals and thereby curate the output and indirectly steer the artistic development of the machine,” he says.
- Three artists have been selected for the six-month remote artist residency programme PATH-AI, which has been developed by the Alan Turing Institute in London, the University of Edinburgh and the RIKEN research institute in Japan. The AI-inspired works of Nouf Aljowaysir, Chris Zhongtian Yuan, and Juan Covelli are presented on Somerset House’s online curated space known as Channel. Brooklyn-based Aljowaysir has made a film, Ana Min Wein (Where Am I From?), which tracks her immigration path to the US, charting her family’s migration history across Saudi Arabia and Iraq. An AI assistant supports her journey in a film.
- The new miHoYo game already uses artificial intelligence techniques, but they have not used it to write narrative content, paying attention to “its impact”.
- >Cal Duran, an artist and art teacher who was one of the judges for competition, said that while Allen’s piece included a mention of Midjourney, he didn’t realize that it was generated by AI when judging it. Still, he sticks by his decision to award it first place in its category, he said, calling it a “beautiful piece”.
- >“I think there’s a lot involved in this piece and I think the AI technology may give more opportunities to people who may not find themselves artists in the conventional way,” he said.
- 3.88/5 on Rate Your Music with 613 reviews, where the best albums of all time get around ⅘
- Received an 86 on Album of the Year with 611 reviews, qualifying for an orange star denoting high quality
- “Runway's tools and AI models have been utilized in films such as Everything Everywhere All At Once,[6] in music videos for artists including A$AP Rocky,[7] Kanye West,[8] Brockhampton, and The Dandy Warhols,[9] and in editing television shows like The Late Show[10] and Top Gear.[11]”
- >“If I had to pay humans, if I had to pay people to do 150-plus artworks, we would have never been able to do it,” - Guillaume Mezino, Kipwak Studio (founder)
- An artist for Bigby Presents: Glory of the Giants! has admitted to using AI to generate "certain details" of new art for the sourcebook.
11.6. Anti-AI Hypocrisy/False Accusations of AI Usage
- Anti-AI artist admits AI can make good art: https://www.reddit.com/r/aiwars/s/yy7e0aFyC0
- One of the main arguments against AI art is that it decreases the number of jobs available. Another argument is that corporations will use it to commodify art to make money. These are clearly contradictory. If commodifying art is bad, why are artists so concerned about not being able to sell (aka commodify) their work?
- Humans also hallucinate
- Artists use references from images found online all the time without compensation, asking for permission, or even crediting the original.
- They even do this to their sources of inspiration.
- For example, the TV show Breaking Bad was inspired by the movie The Godfather according to the director of the former, but the company that owns Breaking Bad never received permission or gave compensation for it. This applies to virtually every piece of art ever made.
- Artists are fine with web scraping for web and image search but not for AI training
- Artists still complain about “ethically-trained” models like Adobe’s Firefly, which was only trained on images owned by Adobe
- Adobe’s contract contains a clause that the images they pay for can be used for any technology, even ones that did not exist when the contract was signed
- Artists mocked NFT owners for complaining about “right-clickers” downloading their images but now complain about AI companies doing the same thing on a larger scale
- AI art is significantly less pollutive compared to human-made art: https://www.nature.com/articles/s41598-024-54271-x
- No complaints about “theft” when DALL-E Mini/CrAIyon was popular in mid-2022 despite being trained via web scraping
- Yet antis accuse AI users of harassment
- Many other artists used the design without asking for permission from the original artist but aren’t criticized for it at all
11.6.1. Piracy Advocacy
11.6.2. Criticism of Copyright Enforcement
11.6.3. Theft Supported By Artists
11.6.4. False Accusations of Artists Using AI
- Someone in the comments accuses OP of being a bot even though their comment history contradicts that
11.7. Historical Complaints About Technology
- “As long as "invention and feeling constitute essential qualities in a work of Art," the writer argued, "Photography can never assume a higher rank than engraving." Photography couldn't qualify as an art in its own right, the explanation went, because it lacked "something beyond mere mechanism at the bottom of it." — 1855 issue of The Crayon
- "It is obvious that this industry, by invading the territories of art, has become art’s most mortal enemy... If it is allowed to supplement art in some of its functions, it will soon have supplanted or corrupted it altogether." - Charles Baudelaire, father of modern art criticism, on the topic of cameras and photography (1859)
- French art critic Charles Baudelaire hated photography and thought it would be the death of art, saying "the photographic industry was the refuge of every would-be painter, every painter too ill-endowed or too lazy to complete his studies"
- Charles Dodgson, AKA Lewis Carroll, threw a fit because dry plate photography was invented, and he was upset now anybody could do it (he used the wet collodion process which is complicated and messy) and abandoned photography.
- Th abolishment of pain in surgery is a chimera. It is absurd to go on seeking it... Knife and pain are two words in surgery that must forever be associated in the consciousness of the patient. - Dr. Alfred Velpeau (1839), French surgeon
There is a young madman proposing to light the streets of London—with what do you suppose—with smoke!
- Sir Walter Scott (1771-1832) [On a proposal to light cities with gaslight.]
They will never try to steal the phonograph because it has no `commercial value.'
- Thomas Edison (1847-1931). (He later revised that opinion.)
This `telephone' has too many shortcomings to be seriously considered as a practical form of communication. The device is inherently of no value to us.
- Western Union internal memo, 1878
Radio has no future.
- Lord Kelvin (1824-1907), British mathematician and physicist, ca. 1897.
While theoretically and technically television may be feasible, commercially and financially I consider it an impossibility, a development of which we need waste little time dreaming.
- Lee DeForest, 1926 (American radio pioneer and inventor of the vacuum tube.)
[Television] won't be able to hold on to any market it captures after the first six months. People will soon get tired of staring at a plywood box every night.
- Darryl F. Zanuck, head of 20th Century-Fox, 1946.
That the automobile has practically reached the limit of its development is suggested by the fact that during the past year no improvements of a radical nature have been introduced.
- Scientific American, Jan. 2, 1909.
There is no likelihood man can ever tap the power of the atom. The glib supposition of utilizing atomic energy when our coal has run out is a completely unscientific Utopian dream, a childish bug-a-boo. Nature has introduced a few fool-proof devices into the great majority of elements that constitute the bulk of the world, and they have no energy to give up in the process of disintegration.
- Robert A. Millikan (1863-1953) [1928 speech to the Chemists' Club (New York)]
...any one who expects a source of power from the transformation of these atoms is talking moonshine...
- Ernest Rutherford (1871-1937) [1933]
There is not the slightest indication that [nuclear energy] will ever be obtainable. It would mean that the atom would have to be shattered at will.
Heavier-than-air flying machines are impossible.
- Lord Kelvin (1824-1907), ca. 1895, British mathematician and physicist
...no possible combination of known substances, known forms of machinery, and known forms of force, can be united in a practical machine by which man shall fly long distances through the air...
- Simon Newcomb (1835-1909), astronomer, head of the U. S. Naval Observatory.
I confess that in 1901 I said to my brother Orville that man would not fly for fifty years. Two years later we ourselves made flights. This demonstration of my impotence as a prophet gave me such a shock that ever since I have distrusted myself and avoided all predictions.
- Wilbur Wright (1867-1912) [In a speech to the Aero Club of France (Nov 5, 1908)]
Airplanes are interesting toys but of no military value.
- Marshal Ferdinand Foch, French military strategist, 1911. He was later a World War I commander.
There is not in sight any source of energy that would be a fair start toward that which would be necessary to get us beyond the gravitative control of the earth.
- Forest Ray Moulton (1872-1952), astronomer, 1935.
To place a man in a multi-stage rocket and project him into the controlling gravitational field of the moon where the passengers can make scientific observations, perhaps land alive, and then return to earth—all that constitutes a wild dream worthy of Jules Verne. I am bold enough to say that such a man-made voyage will never occur regardless of all future advances.
- Lee deForest (1873-1961) (American radio pioneer and inventor of the vacuum tube.) Feb 25, 1957.
Space travel is utter bilge.
- Dr. Richard van der Reit Wooley, Astronomer Royal, space advisor to the British government, 1956. (Sputnik orbited the earth the following year.)
If the world should blow itself up, the last audible voice would be that of an expert saying it can't be done.
It is difficult to say what is impossible, for the dream of yesterday is the hope of today and the reality of tomorrow.
- Robert Goddard (1882-1945)
- John Philip Sousa feared recorded music and thought it would lead to fewer musicians: “Right here is the menace in machine-made music! The first rift in the lute has appeared. The cheaper of these instruments of the home are no longer being purchased as formerly, and all because the automaticmusicdevices are usurping their places. And what is a result? The child becomes indifferent to practice, for when music can be heard in the homes without the labor of study and close application, and without the slow process of acquiring a technic, it will be simply a question of time when the amateur disappears entirely, and with him a host of vocal and instrumental teachers, who will be without field or calling.”
- Wilbur Wright is quoted as saying, "I confess that in 1901, I said to my brother Orville that man would not fly for 50 years." Two years later, ‘man’ was not only flying, but it was these very men who achieved the feat
- Newsweek in 1995: Why the Internet Will Fail: https://thehustle.co/clifford-stoll-why-the-internet-will-fail
- Dec 1909, the Engineering Magazine
- An article in 1890 talking about how the car will not replace the horse: The banker took him to a window. “Look,” he said pointing to the street. “You see all those people on their bicycles riding along the boulevard? There is not as many as there was a year ago. The novelty is wearing off; they are losing interest. That’s just the way it will be with automobiles. People will get the fever; and later they will throw them away. My advice is not to buy the stock. You might make money for a year or two, but in the end you would lose everything you put in. The horse is here to stay, but the automobile is only a novelty — a fad.”
- Coal miners, milkmen, and manual telephone switch operators were all automated away, yet society had yet to collapse
- Most people used to be farmers, and now very few people are.
- Luddites opposed technology but ultimately failed, which was a good thing as we would still be working in steel mills and textile factories today if they had won.
Debunks
12.1 Articles/Videos/Studies
- “this work has used the free version of ChatGPT (GPT-3.5) for acquiring the ChatGPT responses for the manual analysis.”
- “Thus, we chose to only consider the initial answer generated by ChatGPT.”
- “To understand how differently GPT-4 performs compared to GPT-3.5, we conducted a small analysis on 21 randomly selected SO questions where GPT-3.5 gave incorrect answers. Our analysis shows that, among these 21 questions, GPT-4 could answer only 6 questions correctly, and 15 questions were still answered incorrectly.”
- This is an extra 28.6% on top of the 48% that GPT 3.5 was correct on, totaling to ~77% for GPT 4 (equal to (517*0.48+517*6/21)/517) if we assume that GPT 4 correctly answers all of the questions that GPT 3.5 correctly answered, which is highly likely considering GPT 4 is far higher quality than GPT 3.5.
- Note: This was all done in ONE SHOT with no repeat attempts.
- Study was released before GPT-4o and may not have used GPT 4 Turbo, both of which are significantly higher quality than GPT 4 according to the LMSYS arena.
12.1.2 Google’s Search AI Summaries
- The problem has nothing to do with training data. There's two primary problems.
- 1. Googles results aren't generated by the AI, the AI just paraphrases search results. Literally, it just reads the search results and "summarizes" them for you
- 2. Because it's just a summary, the model they use is stupid as fuck. It's not supposed to think critically, it's just supposed to turn a few web pages into a paragraph.
- With actual AI generated results, stupid one-off satire articles like this don't matter, because they're "intellectual outliers". They're both rare, and directly contradicted by a ton of other data. In addition to this, assistants like ChatGPT are actually trained to "think" about the response they're giving, and not just instructed to summarize web results.
- If you just asked the same model without the search results, I can almost guarantee it wouldn't say anything about actually eating rocks or putting glue on pizza. When you combine the fact that it's just being asked to summarize search results with the fact that it's not trained to actually think critically about what it's summarizing, is when you get problems like this.
- Also, much of it circulating social media is edited and fake, which was confirmed by Google itself.
- Google’s response: https://blog.google/products/search/ai-overviews-update-may-2024/
12.1.3 Debunk of “Has Generative AI Already Peaked?” by Computerphile (or the paper “No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance”)
- The claim of the video/paper is that AI will plateau in a logarithmic curve as there is not enough training data for very specific information, like different tree species. This won’t prevent AGI as most humans do not know very specific information like that either and can only learn if given enough training. This can also be done for AI by fine-tuning on that data, such as training it on a dataset of trees labeled with their species. Even rudimentary neural networks have been capable of this for well over a decade, like identifying different classes of images in the CIFAR-10 dataset using convolutional neural networks.
- Synthetic data can also provide nigh infinite training data (see section 15)
- Our state space duality (SSD) framework allows us to design a new architecture (Mamba-2) whose core layer is an a refinement of Mamba’s selective SSM that is 2-8× faster, while continuing to be competitive with Transformers on language modeling: https://arxiv.org/pdf/2405.21060
- A 8B-3.5T hybrid SSM model gets better accuracy than an 8B-3.5T transformer trained on the same dataset:
- * 7% attention, the rest is Mamba2
- * MMLU jumps from 50 to 53.6%
- * Training efficiency is the same
- * Inference cost is much less
- Google DeepMind's JEST method can reduce AI training time by a factor of 13 and decreases computing power demand by 90%. The method uses another pretrained reference model to select data subsets for training based on their "collective learnability: https://arxiv.org/html/2406.17711v1
12.1.4 “Vision language models are blind” Study
- The VLMs tokenize images just like they tokenize text. As a result, it is difficult for them to see small details like red circles surrounding letters in a word (which is one of the tasks in the study) just like how it is difficult for them to count letters in a word.
- LLMs can create images using code based on abstract requests even if they were never trained on any: https://news.mit.edu/2024/understanding-visual-knowledge-language-models-0617
12.1.5. Real Photograph “Won” AI Art Competition
12.1.6. ChatGPT Plagiarized NYT Articles
- OpenAI said in its filing that it took the Times "tens of thousands of attempts to generate the highly anomalous results."
- "In the ordinary course, one cannot use ChatGPT to serve up Times articles at will," OpenAI said.
- OpenAI's filing also said that it and other AI companies would eventually win their cases based on the fair-use question.
- "The Times cannot prevent AI models from acquiring knowledge about facts, any more than another news organization can prevent the Times itself from re-reporting stories it had no role in investigating," OpenAI said
12.1.7. Government Study Finds AI worse than humans in every way at summarizing information
- Used LLAMA 2 70b
- The report mentions some limitations and context to this study: the model used has already been superseded by one with further capabilities which may improve its ability to summarise information, and that Amazon increased the model’s performance by refining its prompts and inputs, suggesting that there are further improvements that are possible. It includes optimism that this task may one day be competently undertaken by machines.
12.1.8. “Generative AI's Illusory Case for Fair Use” Study
Original Source: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4924997
- False caims in the abstract alone:
- >“ [LLMs] do not "know" anything independently of the works on which they are trained, so their output is a function of the copied materials.”
- >“The training works thus do not disappear, as claimed, but are encoded, token by token, into the model and relied upon to generate output.”
- Completely impossible. There are small language models that are only a few gigabytes in size, such as Microsoft’s Phi models. For comparison, Wikipedia alone is nearly 20 GB without media.
- Additionally, it is very difficult to reproduce training data. NYT attempted to do it and it took tens of thousands of tries to produce a small snippet of an article:
- OpenAI said in its filing that it took the Times "tens of thousands of attempts to generate the highly anomalous results."
- "In the ordinary course, one cannot use ChatGPT to serve up Times articles at will," OpenAI said.
- OpenAI's filing also said that it and other AI companies would eventually win their cases based on the fair-use question.
- "The Times cannot prevent AI models from acquiring knowledge about facts, any more than another news organization can prevent the Times itself from re-reporting stories it had no role in investigating," OpenAI said
- >Like an LLM, an AI image generator relies on encoded representations of training works to generate its output.
- >the determination of fair use turned on the fact that the alleged infringer was not seeking to capitalize on expressive content-exactly the opposite of generative AI.
- There are many open-weight models released for free, such as Meta’s LLAMA, Google’s Gemma, and Microsoft’s Phi series.
- >Generative AI's claim to fair use is further hampered by the propensity of models to generate copies and derivatives of training works, which are presumptively infringing
- >Another claim that has been consistently dismissed by courts is that AI models are infringing derivative works of the training materials. The law defines a derivative work as “a work based upon one or more preexisting works, such as a translation, musical arrangement, … art reproduction, abridgment, condensation, or any other form in which a work may be recast, transformed, or adapted.” To most of us, the idea that the model itself (as opposed to, say, outputs generated by the model) can be considered a derivative work seems to be a stretch. The courts have so far agreed. On November 20, 2023, the court in Kadrey v. Meta Platforms said it is “nonsensical” to consider an AI model a derivative work of a book just because the book is used for training.
- Similarly, claims that all AI outputs should be automatically considered infringing derivative works have been dismissed by courts, because the claims cannot point to specific evidence that an instance of output is substantially similar to an ingested work. In Andersen v. Stability AI, plaintiffs tried to argue “that all elements of … Anderson’s copyrighted works … were copied wholesale as Training Images and therefore the Output Images are necessarily derivative;” the court dismissed the argument because—besides the fact that plaintiffs are unlikely able to show substantial similarity—“it is simply not plausible that every Training Image used to train Stable Diffusion was copyrighted … or that all … Output Images rely upon (theoretically) copyrighted Training Images and therefore all Output images are derivative images. … [The argument for dismissing these claims is strong] especially in light of plaintiffs’ admission that Output Images are unlikely to look like the Training Images.”
- >In addition, some AI models rely on retrieval-augmented generation, or RAG, which searches out and copies materials from online sources without permission to respond to user prompts (for example, a query concerning an event that postdates the training of the underlying model). Here again, the materials are being copied and exploited to make use of expressive content.
- (Ignore the Japanese text; it’s from a browser extension)
- That was ALL just from the abstract
12.2 “AI is bad at math”
12.3 “Out-of-touch bosses and managers are forcing workers to use AI even if it is unnecessary, ineffective, or even harmful.”
- Notably, of the seven million British workers that Deloitte extrapolates have used GenAI at work, only 27% reported that their employer officially encouraged this behavior.
- Although Deloitte doesn’t break down the at-work usage by age and gender, it does reveal patterns among the wider population. Over 60% of people aged 16-34 (broadly, Gen Z and younger millennials) have used GenAI, compared with only 14% of those between 55 and 75 (older Gen Xers and Baby Boomers).
12.4 “LLMs always agree with the user, even when they are wrong”
12.5. “LLMs will level out at human level”
- AlphaGo, AlphaZero, Stockfish, and OpenAI Five beat humans at Go, chess, and Dota 2
- Computers are much faster than humans, don’t get tired or injured, and anything they learn can be broadcasted globally in seconds
- Can be trained on only high quality data to be the best of every world
- [2278 AI researchers were surveyed in 2023 and estimated that there is a 50% chance of AI being superior to humans in all possible tasks by 2047](https://aiimpacts.org/wp-content/uploads/2023/04/Thousands_of_AI_authors_on_the_future_of_AI.pdf)
- In 2022, the year they had for that was 2060, and many of their predictions have already come true ahead of time, like AI being capable of answering queries using the web, transcribing speech, translation, and reading text aloud that they thought would only happen after 2025.
12.6. “AI Should Be Doing My Dishes and Laundry Instead”
- They are. Thats what laundry machines and dish washers are for.
- Robots are harder to build than AI software but they are being built
12.7. Goldman Sachs Report On AI Being Overhyped
- The dotcom bubble also happened but the Internet is far larger than it was back then
- Goldman Sachs has other reports on AI that are very bullish
12.8. HIVE AI Detector
12.9. LLMs Can’t Count Letters/LLMs Can’t Compare Numbers/LLMs Can’t Solve Riddles
- o1-preview and mini do it well:
- This is an issue with tokenizers. The LLM groups characters together (called tokens) so it cannot analyze them individually
- >Feucht points to “byte-level” state space models like MambaByte, which can ingest far more data than transformers without a performance penalty by doing away with tokenization entirely. MambaByte, which works directly with raw bytes representing text and other data, is competitive with some transformer models on language-analyzing tasks while better handling “noise” like words with swapped characters, spacing and capitalized characters.
- Usually, asking it to break it down step by step resolves this problem
- There may also be overfitting
- Ex. It may incorrectly believe 9.11 > 9.9 as this is true for software versioning (it may also be a tokenization problem as well)
- It may also incorrectly answer riddles like “which order should I carry the chickens or the fox over a river” for this reason
- This doesn’t work if you use the original phrasing though. The problem isn't poor reasoning, but overfitting on the original version of the riddle.
12.10. LLMs Can’t Learn Continuously
12.11. Apple is Pessimistic On AI
12.12, AI Companies Are Trainng On Benchmarks to Inflate Their Scores
- If LLMs were specifically trained to score well on benchmarks, it could score 100% on all of them very easily by purposefully overfitting: https://arxiv.org/pdf/2309.08632
- The fact that they don’t shows companies are not just cheating
- OpenAI still hasn’t hard coded their LLMs to be correct for common questions like counting the number of “r”s in “strawberry” and finding the greater value between 9.11 and 9.8. If they wanted to cheat to increase benchmark scores, why wouldn’t they solve these issues?
- Some benchmarks like the one used by Scale.ai and the test dataset of MathVista do not release their testing data to the public, so it is impossible to train on them.
- Other benchmarks like LiveBench update every month so training on the dataset will not have any lasting effects
Energy Use/Water Use/Environmental Impact/Cost/Sustainability
See section 3.3 for hardware improvements
- >Our findings reveal that AI systems emit between 130 and 1500 times less CO2e per page of text generated compared to human writers, while AI illustration systems emit between 310 and 2900 times less CO2e per image than their human counterparts.
- >at full utilization, we estimate OpenAI could serve all of its gpt-4o API traffic with less than 10% of their provisioned 60k GPUs.
- Most of their costs are in research and employee payroll, both of which can be cut if they need to go lean.
- This means only 1/3 of their costs are in running existing models (2:1 cost ratio for training vs. running)
- Data centers that host AI are cooled with a closed loop. The water doesn’t even touch computer parts, it just carries the heat away, which is radiated elsewhere. It does not evaporate or get polluted in the loop. Water is not wasted or lost in this process.
- “The most common type of water-based cooling in data centers is the chilled water system. In this system, water is initially cooled in a central chiller, and then it circulates through cooling coils. These coils absorb heat from the air inside the data center. The system then expels the absorbed heat into the outside environment via a cooling tower. In the cooling tower, the now-heated water interacts with the outside air, allowing heat to escape before the water cycles back into the system for re-cooling.”
- Data centers do not use a lot of water. Microsoft’s data center in Goodyear uses 56 million gallons of water a year. The city produces 4.9 BILLION gallons per year just from surface water and, with future expansion, has the ability to produce 5.84 billion gallons (source: https://www.goodyearaz.gov/government/departments/water-services/water-conservation). It produces more from groundwater, but the source doesn't say how much. Additionally, the city actively recharges the aquifer by sending treated effluent to a Soil Aquifer Treatment facility. This provides needed recharged water to the aquifer and stores water underground for future needs. Also, the Goodyear facility doesn't just host AI. We have no idea how much of the compute is used for AI. It's probably less than half.
- Everything consumes power and resources, including superfluous things like video games and social media. Why is AI not allowed to when other, less useful things can?
- Most of their spending is on research. Even if they plateau, they can still easily profit by giving up on improving the AI and selling their extra GPUs so their only major cost is running inference for their existing models, which is much cheaper than training them from scratch.
13.1. AI Is Becoming More Efficient
- Google DeepMind's JEST method can reduce AI training time by a factor of 13 and decreases computing power demand by 90%. The method uses another pretrained reference model to select data subsets for training based on their "collective learnability: https://arxiv.org/html/2406.17711v1
- In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}. It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption. More profoundly, the 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs that are both high-performance and cost-effective. Furthermore, it enables a new computation paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.
- Large but sparsely activated DNNs can consume <1/10th the energy of large, dense DNNs without sacrificing accuracy despite using as many or even more parameters. Geographic location matters for ML workload scheduling since the fraction of carbon-free energy and resulting CO2e vary ~5X-10X, even within the same country and the same organization. We are now optimizing where and when large models are trained. Specific datacenter infrastructure matters, as Cloud datacenters can be ~1.4-2X more energy efficient than typical datacenters, and the ML-oriented accelerators inside them can be ~2-5X more effective than off-the-shelf systems. Remarkably, the choice of DNN, datacenter, and processor can reduce the carbon footprint up to ~100-1000X.
- In this work, we show that MatMul operations can be completely eliminated from LLMs while maintaining strong performance at billion-parameter scales. Our experiments show that our proposed MatMul-free models achieve performance on-par with state-of-the-art Transformers that require far more memory during inference at a scale up to at least 2.7B parameters. We investigate the scaling laws and find that the performance gap between our MatMul-free models and full precision Transformers narrows as the model size increases. We also provide a GPU-efficient implementation of this model which reduces memory usage by up to 61% over an unoptimized baseline during training. By utilizing an optimized kernel during inference, our model's memory consumption can be reduced by more than 10x compared to unoptimized models. To properly quantify the efficiency of our architecture, we build a custom hardware solution on an FPGA which exploits lightweight operations beyond what GPUs are capable of. We processed billion-parameter scale models at 13W beyond human readable throughput, moving LLMs closer to brain-like efficiency. This work not only shows how far LLMs can be stripped back while still performing effectively, but also points at the types of operations future accelerators should be optimized for in processing the next generation of lightweight LLMs.
- Deepsilicon runs neural nets with 5x less RAM and ~20x faster. They are building software and custom silicon for it: https://x.com/sdianahu/status/1833186687369023550
- ”representing transformer models as ternary values (-1, 0, 1) eliminates the need for computationally expensive floating-point math"
- Runs SOTA models
- Do you know your LLM uses less than 1% of your GPU at inference? Too much time is wasted on KV cache memory access ➡️ We tackle this with the 🎁 Block Transformer: a global-to-local architecture that speeds up decoding up to 20x: https://x.com/itsnamgyu/status/1807400609429307590
- Mixture of A Million Experts. Daniel Jeffries:"Reduces inference cost and memory usage, scales to millions of experts, oh and just happens to overcome catastrophic forgetting and enable life long learning for the model." https://arxiv.org/abs/2407.04153
- This paper introduces PEER (parameter efficient expert retrieval), a novel layer design that utilizes the product key technique for sparse retrieval from a vast pool of tiny experts (over a million). Experiments on language modeling tasks demonstrate that PEER layers outperform dense FFWs and coarse-grained MoEs in terms of performance-compute trade-off. By enabling efficient utilization of a massive number of experts, PEER unlocks the potential for further scaling of transformer models while maintaining computational efficiency.
- It has been proved that by simply adding new experts and regularizing them properly, MoE models can adapt to continuous data streams
- By leveraging MLLMs with prompts, E5-V effectively bridges the modality gap between different types of inputs, demonstrating strong performance in multimodal embeddings even without fine-tuning. We propose a single modality training approach for E5-V, where the model is trained exclusively on text pairs. This method demonstrates significant improvements over traditional multimodal training on image-text pairs, while reducing training costs by approximately 95%. Additionally, this approach eliminates the need for costly multimodal training data collection. Extensive experiments across four types of tasks demonstrate the effectiveness of E5-V. As a universal multimodal model, E5-V not only achieves but often surpasses state-of-the-art performance in each task, despite being trained on a single modality.
- During patch-level training, we feed the language model shorter sequences of patches and train it to predict the next patch, thereby processing the majority of the training data at a significantly reduced computational cost. Following this, the model continues token-level training on the remaining training data to align with the inference mode. Experiments on a diverse range of models (370M-2.7B parameters) demonstrate that patch-level training can reduce overall computational costs to 0.5x, without compromising the model performance compared to token-level training.
- RGM, active inference non-llm approach using 90% less data (less need for synthetic data, lower energy footprint). 99.8% accuracy in MNIST benchmark using 90% less data to train on less powerful devices: https://arxiv.org/pdf/2407.20292
- Use for Atari game performance: “This fast structure learning took about 18 seconds on a personal computer. “
- Use for MNIST dataset classification: For example, the variational procedures above attained state-of-the-art classification accuracy on a self-selected subset of test data after seeing 10,000 training images. Each training image was seen once, with continual learning (and no notion of batching). Furthermore, the number of training images actually used for learning was substantially smaller10 than 10,000; because active learning admits only those informative images that reduce expected free energy. This (Maxwell’s Demon) aspect of selecting the right kind of data for learning will be a recurrent theme in subsequent sections. Finally, the requisite generative model was self-specifying, given some exemplar data. In other words, the hierarchical depth and size of the requisite tensors were learned automatically within a few seconds on a personal computer.
- Based on the above guidance, a series of DiT-MoE experimentally achieves performance on par with dense networks yet requires much less computational load during inference. More encouragingly, we demonstrate the potential of DiT-MoE with synthesized image data, scaling diffusion model at a 15.5B parameter that attains a new SoTA FID-50K score of 1.80 in 512x512 resolution settings.
- For reference, Stable Diffusion XL is only about 3 billion parameters
- Compared to training from scratch (i.e., Pix2pix), we achieve a 15x training time acceleration for a new concept while obtaining even better image generation quality.
- The new GPT-4o is slightly better and 33% cheaper than the old one! Right now, it's only a tad below Sonnet 3.5 on Livebench!
- The company says GPT-4o mini outperforms industry leading small AI models on reasoning tasks involving text and vision. As small AI models improve, they are becoming more popular for developers due to their speed and cost efficiencies compared to larger models, such as GPT-4 Omni or Claude 3.5 Sonnet. They’re a useful option for high volume, simple tasks that developers might repeatedly call on an AI model to perform.
- The company claims its newest AI model scores 82% on MMLU, a benchmark to measure reasoning, compared to 79% for Gemini 1.5 Flash and 75% for Claude 3 Haiku, according to data from Artificial Analysis. On MGSM, which measures math reasoning, GPT-4o mini scored 87%, compared to 78% for Flash and 72% for Haiku.
- Further, OpenAI says GPT-4o mini is significantly more affordable to run than its previous frontier models, and more than 60% cheaper than GPT-3.5 Turbo. Today, GPT-4o mini supports text and vision in the API, and OpenAI says the model will support video and audio capabilities in the future.
- So, what took OpenAI so long? Godement said it was “pure prioritization” as the company was focused on creating bigger and better models like GPT-4, which took a lot of “people and compute efforts.”
- LMSYS arena shows it is better than GPT 4 and Claude 3 Opus
- https://x.com/terryyuezhuo/status/1813998867039617444
- GPT-4o mini on BigCodeBench-Hard is out:
- Complete Pass@1: 27.0
- Instruct Pass@1: 24.3
- Average: 25.7
- The average score is very close to Claude-3-Opus (26.0)!
- The key results from this work are, (1) Q-Sparse can achieve results comparable to those of baseline LLMs while being much more efficient at inference time; (2) We present an inference-optimal scaling law for sparsely-activated LLMs; (3) Q-Sparse is effective in different settings, including training-from-scratch, continue-training of off-the-shelf LLMs, and finetuning; (4) Q-Sparse works for both full-precision and 1-bit LLMs (e.g., BitNet b1.58). Particularly, the synergy of BitNet b1.58 and Q-Sparse (can be equipped with MoE) provides the cornerstone and a clear path to revolutionize the efficiency, including cost and energy consumption, of future LLMs.
- they can significantly reduce costs without compromising quality, with cost reductions of over 85% on MT Bench, 45% on MMLU, and 35% on GSM8K as compared to using only GPT-4, while still achieving 95% of GPT-4’s performance
- The key results from this work are, (1) Q-Sparse can achieve results comparable to those of baseline LLMs while being much more efficient at inference time; (2) We present an inference-optimal scaling law for sparsely-activated LLMs; (3) Q-Sparse is effective in different settings, including training-from-scratch, continue-training of off-the-shelf LLMs, and finetuning; (4) Q-Sparse works for both full-precision and 1-bit LLMs (e.g., BitNet b1.58). Particularly, the synergy of BitNet b1.58 and Q-Sparse (can be equipped with MoE) provides the cornerstone and a clear path to revolutionize the efficiency, including cost and energy consumption, of future LLMs.
- >📌 AI agent evaluations must be cost-controlled. Simple baselines like retrying or gradually increasing model temperature can match or outperform complex "state-of-the-art" agents on benchmarks like HumanEval while costing much less.
- 📌 Jointly optimizing accuracy and cost can yield better agent designs. By visualizing results as a Pareto curve of accuracy vs. inference cost, researchers can explore new design spaces. An example modification to the DSPy framework reduced costs by over 50% while maintaining accuracy on HotPotQA.
- Finetuning NeMo 12B (from its in 12GB of VRAM and is 2x faster, and uses 60% less VRAM, with no accuracy degradation and works for free in a Google Colab: https://x.com/rohanpaul_ai/status/1814482763472613551
- LookupViT: https://x.com/fly51fly/status/1814058398847058044
- - LookupViT introduces a novel Multi-Head Bidirectional Cross-attention (MHBC) module that enables effective information flow with significant computational savings.
- - LookupViT reduces quadratic dependence on number of lookup tokens and achieves significant FLOPs reduction compared to vanilla ViT blocks.
- our method, called SparseCraft, achieves state-of-the-art performances both in novel-view synthesis and reconstruction from sparse views in standard benchmarks, while requiring less than 10 minutes for training.
- using only 37M publicly available real and synthetic images, we train a 1.16 billion parameter sparse transformer with only $1,890 economical cost and achieve a 12.7 FID in zero-shot generation on the COCO dataset. Notably, our model achieves competitive FID and high-quality generations while incurring 118x lower cost than stable diffusion models and 14x lower cost than the current state-of-the-art approach that costs $28,400.
- This observation motivates applying a "compute-optimal" scaling strategy, which acts to most effectively allocate test-time compute adaptively per prompt. Using this compute-optimal strategy, we can improve the efficiency of test-time compute scaling by more than 4x compared to a best-of-N baseline. Additionally, in a FLOPs-matched evaluation, we find that on problems where a smaller base model attains somewhat non-trivial success rates, test-time compute can be used to outperform a 14x larger model.
- Small models are much cheaper to use
- The released model LongLLaVA~(Long-Context Large Language and Vision Assistant) is the first hybrid MLLM, which achieved a better balance between efficiency and effectiveness. LongLLaVA not only achieves competitive results across various benchmarks, but also maintains high throughput and low memory consumption. Especially, it could process nearly a thousand images on a single A100 80GB GPU, showing promising application prospects for a wide range of tasks.
- we introduce **LongRecipe**, an efficient training strategy for extending the context window of LLMs, including impactful token analysis, position index transformation, and training optimization strategies. It simulates long-sequence inputs while maintaining training efficiency and significantly improves the model's understanding of long-range dependencies. Experiments on three types of LLMs show that LongRecipe can utilize long sequences while requiring only 30% of the target context window size, and reduces computational training resource over 85% compared to full sequence training. Furthermore, LongRecipe also preserves the original LLM's capabilities in general tasks. Ultimately, *we can extend the effective content window of open-source LLMs from 8k to 128k, achieving performance close to GPT-4 with just one day of dedicated training using a single GPU with 80G memory.* Our code is released at https://github.com/zhiyuanhubj/LongRecipe
- we reduce cost by equipping LLMs with an explicit memory format cheaper than model parameters and text retrieval-augmented generation (RAG).
- Conceptually, with most of its knowledge externalized to explicit memory, the LLM can enjoy a smaller parameter size, training cost, and inference cost, all proportional to the amount of remaining “abstract knowledge”.
- As a preliminary proof of concept, we train from scratch a 2.4B LLM, which achieves better performance than much larger LLMs as well as RAG models, and maintains higher decoding speed than RAG.
- We introduce a memory circuitry theory to support the externalization of knowledge, and present novel techniques including a memory sparsification mechanism that makes storage tractable and a two-stage pretraining scheme that facilitates memory formation.
- Glaze/Nightshade takes 15 minutes of computation for one image. How is that environmentally friendly?
AI Inbreeding/AI Training Off Its Own Output/AI Running Out Of Data
- Auto Evol allows the training of WizardLM2 to be conducted with nearly an unlimited number and variety of synthetic data.
- Auto Evol-Instruct automatically designs evolving methods that make given instruction data more complex, enabling almost cost-free adaptation to different tasks by only changing the input data of the framework …This optimization process involves two critical stages: (1) Evol Trajectory Analysis: The optimizer LLM carefully analyzes the potential issues and failures exposed in instruction evolution performed by evol LLM, generating feedback for subsequent optimization. (2) Evolving Method Optimization: The optimizer LLM optimizes the evolving method by addressing these identified issues in feedback. These stages alternate and repeat to progressively develop an effective evolving method using only a subset of the instruction data. Once the optimal evolving method is identified, it directs the evol LLM to convert the entire instruction dataset into more diverse and complex forms, thus facilitating improved instruction tuning.
- Our experiments show that the evolving methods designed by Auto Evol-Instruct outperform the Evol-Instruct methods designed by human experts in instruction tuning across various capabilities, including instruction following, mathematical reasoning, and code generation. On the instruction following task, Auto Evol-Instruct can achieve a improvement of 10.44% over the Evol method used by WizardLM-1 on MT-bench; on the code task HumanEval, it can achieve a 12% improvement over the method used by WizardCoder; on the math task GSM8k, it can achieve a 6.9% improvement over the method used by WizardMath.
- With the new technology of Auto Evol-Instruct, the evolutionary synthesis data of WizardLM-2 has scaled up from the three domains of chat, code, and math in WizardLM-1 to dozens of domains, covering tasks in all aspects of large language models. This allows Arena Learning to train and learn from an almost infinite pool of high-difficulty instruction data, fully unlocking all the potential of Arena Learning.
- Trained on new competition math dataset ever released, with 860K problem solution pairs that was created with GPT 4
- “We selected approximately 70k problems from the NuminaMath-CoT dataset, focusing on those with numerical outputs, most of which are integers. We then utilized a pipeline leveraging GPT-4 to generate TORA-like reasoning paths, executing the code and producing results until the solution was complete. We filtered out solutions where the final answer did not match the reference and repeated this process three times to ensure accuracy and consistency. This iterative approach allowed us to generate high-quality TORA data efficiently.”
- For AMC 10 2023, the average score is 43%,
- AIME Floor: 70% (top ~6%)
- Distinction: 75%
- Distinguished Honor Roll: 90%
- announcing our first suite of specialized language models for document processing tasks (OCR correction, text segmentation, bibliographic extraction) and a new major multimodal dataset we used to train them, Finance Commons: https://x.com/Dorialexander/status/1814234566158118977
- - This paper investigates how scaling up the datastore in retrieval-augmented language models improves performance without signs of saturation on both language modeling and downstream tasks.
- - The authors built MASSIVEDS, an open-sourced 1.4 trillion token datastore covering a diverse set of domains. MASSIVEDS is the largest open-sourced datastore for studying retrieval scaling.
- - They designed an efficient pipeline to make the study computationally tractable by sharing indexing and retrieval computation across different datastore configurations.
- - Experiments show datastore scaling brings consistent improvements in language modeling perplexity on web data and scientific papers.
- - On downstream tasks, datastore scaling significantly boosts performance on knowledge-intensive QA tasks like TriviaQA and Natural Questions. Smaller retrieval models can match or exceed their larger LM-only counterparts.
- - Datastore scaling also improves performance on reasoning tasks like MMLU, but the improvements are more modest, indicating potential need for more in-domain data.
- - Compared to LM-only models, retrieval models achieve superior compute-optimal scaling curves by offloading FLOPs from model pretraining to datastore indexing.
- - Analysis reveals the retriever can stay robust to out-of-domain data and tend to retrieve relevant documents even from a broad datastore.
- Michael Gerstenhaber, product lead at Anthropic, says that the improvements are the result of architectural tweaks and new training data, including AI-generated data. Which data specifically? Gerstenhaber wouldn’t disclose, but he implied that Claude 3.5 Sonnet draws much of its strength from these training sets.
- Google DeepMind's JEST method can reduce AI training time by a factor of 13 and decreases computing power demand by 90%. The method uses another pretrained reference model to select data subsets for training based on their "collective learnability: https://arxiv.org/html/2406.17711v1
- >In this work, we show that reducing hallucination on a synthetic task can also reduce hallucination on real-world downstream tasks. Our method, SynTra, first designs a synthetic task where hallucinations are easy to elicit and measure. It next optimizes the LLM's system message via prefix-tuning on the synthetic task, and finally transfers the system message to realistic, hard-to-optimize tasks. Across three realistic abstractive summarization tasks, SynTra reduces hallucination for two 13B-parameter LLMs using only a synthetic retrieval task for supervision. We also find that optimizing the system message rather than the model weights can be critical; fine-tuning the entire model on the synthetic task can counterintuitively increase hallucination. Overall, SynTra demonstrates that the extra flexibility of working with synthetic data can help mitigate undesired behaviors in practice.
- >Data quality: Unlike real-world data, synthetic data removes the inaccuracies or errors that can occur when working with data that is being compiled in the real world. Synthetic data can provide high quality and balanced data if provided with proper variables. The artificially-generated data is also able to fill in missing values and create labels that can enable more accurate predictions for your company or business.
- >Our method employs pretrained text-to-image model to synthesize image embeddings from captions generated by an LLM. Despite the text-to-image model and VLM initially being trained on the same data, our approach leverages the image generator’s ability to create novel compositions, resulting in synthetic image embeddings that expand beyond the limitations of the original dataset. Extensive experiments demonstrate that our VLM, finetuned on synthetic data achieves comparable performance to models trained solely on human-annotated data, while requiring significantly less data. Furthermore, we perform a set of analyses on captions which reveals that semantic diversity and balance are key aspects for better downstream performance. Finally, we show that synthesizing images in the image embedding space is 25% faster than in the pixel space. We believe our work not only addresses a significant challenge in VLM training but also opens up promising avenues for the development of self-improving multi-modal models.
- >“We systematically investigate whether synthetic data from current state-of-the-art text-to-image generation models are readily applicable for image recognition. Our extensive experiments demonstrate that synthetic data are beneficial for classifier learning in zero-shot and few-shot recognition, bringing significant performance boosts and yielding new state-of-the-art performance. Further, current synthetic data show strong potential for model pre-training, even surpassing the standard ImageNet pre-training. We also point out limitations and bottlenecks for applying synthetic data for image recognition, hoping to arouse more future research in this direction.”
- >By leveraging MLLMs with prompts, E5-V effectively bridges the modality gap between different types of inputs, demonstrating strong performance in multimodal embeddings even without fine-tuning. We propose a single modality training approach for E5-V, where the model is trained exclusively on text pairs. This method demonstrates significant improvements over traditional multimodal training on image-text pairs, while reducing training costs by approximately 95%. Additionally, this approach eliminates the need for costly multimodal training data collection. Extensive experiments across four types of tasks demonstrate the effectiveness of E5-V. As a universal multimodal model, E5-V not only achieves but often surpasses state-of-the-art performance in each task, despite being trained on a single modality.
- Scaling Synthetic Data Creation with 1,000,000,000 Personas
- - Presents a collection of 1B diverse personas automatically curated from web data
- - Massive gains on MATH: 49.6 ->64.9
- An open source Tool Use full finetune of Llama 3 that reaches the #1 position on BFCL beating all other models, including proprietary ones like Claude Sonnet 3.5, GPT-4 Turbo, GPT-4o and Gemini 1.5 Pro.
- The model has been trained on synthetic data only. This is a powerful full finetune, not a LoRA. Yes, we've checked rigorously for overfitting using the LMSYS described robust decontamination techniques, they only score 5.6% on SFT synthetic data and 1.3% on synthetic DPO data.
- >3x faster token prediction means 3x cheaper and on top of that it seems to greatly increase coding, summarization, and mathematical reasoning abilities. Best of all the improvements have shown to only become more significant with larger models (13b+ according to the paper). Unlike some other research where improvements are mostly seen in smaller models and won't advance the frontier, this is infact worse performing on smaller models and shows great potential at scale.
- Has a cutoff date of 6/12/24 with no “inbreeding” issues
- Based on the above guidance, a series of DiT-MoE experimentally achieves performance on par with dense networks yet requires much less computational load during inference. More encouragingly, we demonstrate the potential of DiT-MoE with synthesized image data, scaling diffusion model at a 15.5B parameter that attains a new SoTA FID-50K score of 1.80 in 512x512 resolution settings.
- For reference, Stable Diffusion XL is only about 3 billion parameters
- Nemotron-4 340B, a family of models optimized for NVIDIA NeMo and NVIDIA TensorRT-LLM, includes cutting-edge instruct and reward models, and a dataset for generative AI training.
- The Nemotron-4 340B Instruct model creates diverse synthetic data that mimics the characteristics of real-world data, helping improve data quality to increase the performance and robustness of custom LLMs across various domains.
- Then, to boost the quality of the AI-generated data, developers can use the Nemotron-4 340B Rewardmodel to filter for high-quality responses. Nemotron-4 340B Reward grades responses on five attributes: helpfulness, correctness, coherence, complexity and verbosity. It’s currently first place on the Hugging Face RewardBench leaderboard, created by AI2, for evaluating the capabilities, safety and pitfalls of reward models.
- RGM, active inference non-llm approach using 90% less data (less need for synthetic data, lower energy footprint). 99.8% accuracy in MNIST benchmark using 90% less data to train on less powerful devices: https://arxiv.org/pdf/2407.20292
- >Use for Atari game performance: “This fast structure learning took about 18 seconds on a personal computer.”
- >Use for MNIST dataset classification: For example, the variational procedures above attained state-of-the-art classification accuracy on a self-selected subset of test data after seeing 10,000 training images. Each training image was seen once, with continual learning (and no notion of batching). Furthermore, the number of training images actually used for learning was substantially smaller10 than 10,000; because active learning admits only those informative images that reduce expected free energy. This (Maxwell’s Demon) aspect of selecting the right kind of data for learning will be a recurrent theme in subsequent sections. Finally, the requisite generative model was self-specifying, given some exemplar data. In other words, the hierarchical depth and size of the requisite tensors were learned automatically within a few seconds on a personal computer.
- Nvidia Research team has developed a method to efficiently create smaller, accurate language models by using structured weight pruning and knowledge distillation, offering several advantages for developers:
- Strawberry can solve math problems it hasn't seen before—something today’s chatbots cannot reliably do
- When given additional time to “think,” the Strawberry model can also answer customers’ questions about more subjective topics, such as product marketing strategies. To demonstrate Strawberry’s prowess with language-related tasks, OpenAI employees have shown their co-workers how Strawberry can, for example, solve New York Times Connections, a complex word puzzle
- Its sales of LLMs to corporations and of ChatGPT subscriptions have roughly tripled to $283 million in monthly revenue compared to a year ago
- OpenAI’s prospects rest in part on the eventual launch of a new flagship LLM it is currently developing, code-named Orion
- OpenAI’s prospects rest in part on the eventual launch of a new flagship LLM it is currently developing, code-named Orion
- Using Strawberry to generate higher-quality training data could help OpenAI reduce the number of errors its models generate, otherwise known as hallucinations, said Alex Graveley, CEO of agent startup Minion AI and former chief architect of GitHub Copilot.
- Imagine “a model without hallucinations, a model where you ask it a logic puzzle and it’s right on the first try,” Graveley said. The reason why the model is able to do that is because “there is less ambiguity in the training data, so it’s guessing less.”
- “We feel like we have enough [data] for this next model,” Altman said at an event in May, likely referring to Orion. “We have done all sorts of experiments including generating synthetic data.”
- Strawberry has its roots in research. It was started years ago by Ilya Sutskever, then OpenAI's chief scientist. He recently left to start a competing AI lab. Before he left, OpenAI researchers Jakub Pachocki and Szymon Sidor built on Sutskever's work by developing a new math-solving model, Q*, alarming some researchers focused on AI safety.
- Last year, in the leadup to Q*, OpenAI researchers developed a variation of a concept known as test-time computation, meant to boost LLMs’ problem-solving abilities. The method gives them the opportunity to spend more time considering all parts of a command or question someone has asked the model to execute. At the time, Sutskever published a blog post related to this work.
- In real-world booking experiments on Open Table, MultiOn’s Agents drastically improved the zero-shot performance of the LLaMa-3 model from an 18.6% success rate to 81.7%, a 340% jump after just one day of autonomous data collection and further to 95.4% with online search. These results highlight our method’s efficiency and ability for autonomous web agent improvement.
- Guided Search with MCTS: This technique autonomously generates data by exploring different actions and web-pages, balancing exploration and exploitation. MCTS expands the action space using high sampling temperatures and diverse prompting, ensuring diverse and optimal trajectory collections.
- AI Self-Critique: At each step, AI-based self-critique provides valuable feedback, refining the agent's decision-making process. This step-level feedback is crucial for long-horizon tasks, where sparse signals often lead to learning difficulties.
- Direct Preference Optimization: The DPO algorithm fine-tunes the model by constructing preference pairs from MCTS-generated data. This off-policy training method allows the model to learn effectively from aggregate datasets including the sub-optimal branches explored during search, improving success rates in complex environments.
- - Generates synthetic examples grounded in real data
- - Curation step makes data high quality based on answerability
- - Improves performance on two challenging domains: Multi-hop QA and using tools: SQL for tabular QA
14.1 AI Image Training
- Prompt with no additional editing: meme image with two men in it. On the left side the man is taller and is wearing a shirt that says Black Forest Labs. On the right side the other smaller scrawny man is wearing a shirt that says Stability AI and is sad. The taller man is hitting the back of the head of the small man. A caption coming from the tall man reads "That's how you do a next-gen model!
- Prompt: Gameplay screenshot of Counter Strike Global Offensive. It takes place in a Middle Eastern place called Dust 2. There are enemy soldiers shooting at you.
- Prompt: low quality and motion blur shaky photo of a CRT television on top of a wooden drawer in an average bedroom. The lighting from is dim and warm ceiling light that is off screen. In the TV there is Dark Souls videogame gameplay on it. The screen of the TV is overexposed.
- Created on first try. Robe and hands are perfect
- First attempt: "Photo of a red sphere on top of a blue cube. Behind them is a green triangle, on the right of the triangle is a dog, on the left is a cat."
- Prompt: person take photo of Graffiti art spelling out the words "WAFERSELAMAT", graffiti, white wall, dynamic color, spray paint,
- Prompt: Close-up of LEGO chef minifigure cooking for homeless. Focus on LEGO hands using utensils, showing culinary skill. Warm kitchen lighting, late morning atmosphere. Canon EOS R5, 50mm f/1.4 lens. Capture intricate cooking techniques. Background hints at charitable setting. Inspired by Paul Bocuse and Massimo Bottura's styles. Freeze-frame moment of food preparation. Convey compassion and altruism through scene details.
- A family of multimodal autoregressive models capable of various vision and language tasks, particularly excelling in generating flexible photorealistic images from text descriptions.
- Training a diffusion model better than stable diffusion 1.5 and DALLE 2 from scratch for $1890 on only 37 real and synthetic million images: https://arxiv.org/abs/2407.15811
- using only 37M publicly available real and synthetic images, we train a 1.16 billion parameter sparse transformer with only $1,890 economical cost and achieve a 12.7 FID in zero-shot generation on the COCO dataset. Notably, our model achieves competitive FID and high-quality generations while incurring 118x lower cost than stable diffusion models and 14x lower cost than the current state-of-the-art approach that costs $28,400.
- We introduce Diffusion Augmented Agents (DAAG), a novel framework that leverages large language models, vision language models, and diffusion models to improve sample efficiency and transfer learning in reinforcement learning for embodied agents. DAAG hindsight relabels the agent's past experience by using diffusion models to transform videos in a temporally and geometrically consistent way to align with target instructions with a technique we call Hindsight Experience Augmentation. A large language model orchestrates this autonomous process without requiring human supervision, making it well-suited for lifelong learning scenarios. The framework reduces the amount of reward-labeled data needed to 1) finetune a vision language model that acts as a reward detector, and 2) train RL agents on new tasks. We demonstrate the sample efficiency gains of DAAG in simulated robotics environments involving manipulation and navigation. Our results show that DAAG improves learning of reward detectors, transferring past experience, and acquiring new tasks - key abilities for developing efficient lifelong learning agents. Supplementary material and visualizations are available on our website this https URL: https://sites.google.com/view/diffusion-augmented-agents/
AI Achievements
- "GPT-3 displayed some of the cognitive biases observed in people, but they have largely disappeared in the latest generation of LLMs. The tests...designed to be challenging for humans, possibly no longer challenge the growing reasoning abilities in LLMs" https://arxiv.org/pdf/2303.13988
- We’re excited to release DiscoPOP: a new SOTA preference optimization algorithm that was discovered and written by an LLM!
- Our method leverages LLMs to propose and implement new preference optimization algorithms. We then train models with those algorithms and evaluate their performance, providing feedback to the LLM. By repeating this process for multiple generations in an evolutionary loop, the LLM discovers many highly-performant and novel preference optimization objectives!
- Our findings indicate a 21 percent decrease in the number of job posts for automation-prone jobs related to writing and coding compared to jobs requiring manual-intensive skills after the introduction of ChatGPT. We also find that the introduction of Image-generating AI technologies led to a significant 17 percent decrease in the number of job posts related to image creation. Furthermore, we use Google Trends to show that the more pronounced decline in the demand for freelancers within automation-prone jobs correlates with their higher public awareness of ChatGPT's substitutability.
- Our bot performs better than experienced human forecasters and performs roughly the same as (and sometimes even better than) crowds of experienced forecasters
- Our bot and other forecasting bots can be used in a wide variety of contexts. For example, these AIs could help policymakers minimize bias in their decision-making or help improve the information ecosystem by providing trustworthy, calibrated forecasts.
- On the 177 events, the Metaculus crowd got 87.0% accuracy, while FiveThirtyNine got 87.7% ± 1.4. A link to the technical report is here. This bot lacks many of the drawbacks of prediction markets. It makes forecasts within seconds. Additionally, groups of humans do not need to be incentivized with cash prizes to make and continually update their predictions. Forecasting AIs are several orders of magnitude faster and cheaper than prediction markets, and they’re similarly accurate.
- The bot is not fine-tuned, and doing so could potentially make it far more accurate. It simply retrieves articles and writes a report as guided through an engineered prompt. (Its prompt can be found by clicking on the gear icon in forecast.safe.ai.) Moreover, probabilities from AIs are also known to lead to automation bias, and improvements in the interface could ameliorate this.
- Based on the claims of the research team, their transformer model optimally solves previously unseen Sokoban puzzles 93.7% of the time, while using up to 26.8% fewer search steps than standard A∗ search. Their solution also robustly follows the execution trace of a symbolic planner and improves (in terms of trace length) beyond the human-crafted rule-based planning strategy it was initially trained on.
- ChatGPT-4 can generate ideas much faster and cheaper than students, the ideas are on average of higher quality (as measured by purchase-intent surveys) and exhibit higher variance in quality. More important, the vast majority of the best ideas in the pooled sample are generated by ChatGPT and not by the students. Providing ChatGPT with a few examples of highly-rated ideas further increases its performance.
15.1. Jobs
- $6m less on producing images.
- - 1,000 in-house AI-produced images in 3 months. Includes the creative concept, quality check, and legal compliance.
- - AI-image production reduced from 6 WEEKS TO 1 WEEK ONLY.
- - Customer response to AI images on par with human produced images.
- - Cutting external marketing agency costs by 25% (mainly translation, production, CRM, and social agencies).
- Our in-house marketing team is HALF the size it was last year but is producing MORE!
- We’ve removed the need for stock imagery from image banks like
- @gettyimages
- Now we use genAI tools like Midjourney, DALL-E, and Firefly to generate images, and Topaz Gigapixel and Photoroom to make final adjustments.
- Faster images means more app updates, which is great for customers. And our employees get to work on more fun projects AND we're saving money.
- “If the physician is unhappy with the note, Lall said, they can ask the AI model to regenerate the information or add more detail to any one of the categories. While the tool has some imperfections, she said, the improvements have been noticeable over the 10 months since she began using it.“I really feel this should be the next gold standard for all of our doctors. It decreases the cognitive load you feel at the end of the day,” she said.The Ford government has been so impressed with the technology that it announced a pilot program to allow 150 family physicians to use AI Scribe as part of their practices. The health minister said the early signs were promising but stressed government would proceed carefully.”
15.2. Medicine
- Recall (false positive) rate and radiologist reading workload decreased significantly in AI-screened group
- Using AI, breast radiologists in Denmark have improved breast cancer screening performance and reduced the rate of false-positive findings.
- In total, 60,751 women were screened without AI, and 58,246 women were screened with the AI system. In the AI implementation group, 66.9% (38,977) of the screenings were single-read, and 33.1% (19,269) were double-read with AI assistance.
- Compared to screening without AI, screening with the AI system detected significantly more breast cancers (0.82% versus 0.70%) and had a lower false-positive rate (1.63% versus 2.39%).
- “In the AI-screened group, the recall rate decreased by 20.5 percent, and the radiologists’ reading workload was lowered by 33.4 percent,” Dr. Lauritzen said.
- The positive predictive value of AI screening was also greater than that of screening without AI (33.5% versus 22.5%). In the AI group, a higher proportion of invasive cancers detected were 1 centimeter or less in size (44.93% vs. 36.60%).
- “All screening performance indicators improved except for the node-negative rate which showed no evidence of change,” Dr. Lauritzen said.
- Scanning images of cells could lead to new diagnostic and monitoring strategies for disease.
- Scanning images of cells could lead to new diagnostic and monitoring strategies for disease.
- We evaluate Med-Gemini on 14 medical benchmarks, establishing new state-of-the-art (SoTA) performance on 10 of them, and surpass the GPT-4 model family on every benchmark where a direct comparison is viable, often by a wide margin. On the popular MedQA (USMLE) benchmark, our best-performing Med-Gemini model achieves SoTA performance of 91.1% accuracy, using a novel uncertainty-guided search strategy. On 7 multimodal benchmarks including NEJM Image Challenges and MMMU (health & medicine), Med-Gemini improves over GPT-4V by an average relative margin of 44.5%. We demonstrate the effectiveness of Med-Gemini's long-context capabilities through SoTA performance on a needle-in-a-haystack retrieval task from long de-identified health records and medical video question answering, surpassing prior bespoke methods using only in-context learning. Finally, Med-Gemini's performance suggests real-world utility by surpassing human experts on tasks such as medical text summarization, alongside demonstrations of promising potential for multimodal medical dialogue, medical research and education.
- Human doctors + AI did worse, than AI by itself. The mere involvement of a human reduced the accuracy of the diagnosis.
- AI was consistently rated to have better bedside manner than human doctors.
- Humans near chance level (50-55% accuracy) on rarest conditions, while CheXzero maintains 64-68% accuracy.
- Remarkably, AI doctors can treat 10,000 [simulated] patients in just a few days. It would take human doctors at least two years to treat that many patients. Furthermore, evolved doctor agents achieved an impressive 93.06 percent accuracy rate on the MedQA dataset (US Medical Licensing Exam questions) covering major respiratory diseases. They simulate the entire process of diagnosing and treating patients, including consultation, examination, diagnosis, treatment and follow-up.
- Research team leader of the Agent Hospital Liu Yang, also executive dean of Institute for AI Industry Research (AIR) and associate dean of the Department of Computer Science and Technology at Tsinghua University, told the Global Times that the AI hospital town is set to transform the way doctors diagnose and treat patients, bringing immense benefits to both medical professionals and the general public.
- For example, this innovative concept allows for virtual patients to be treated by real doctors, providing medical students with enhanced training opportunities. By simulating a variety of AI patients, medical students can confidently propose treatment plans without the fear of causing harm to real patients due to decision-making error, Liu said.
- This simulation training enables medical students to practice diagnosis and treatment in a risk-free environment, ultimately leading to the cultivation of highly skilled doctors, according to Liu.
- If the patients in the town are real and the doctors are virtual, online telemedicine services can be provided to patients. The AI hospital town utilizes a vast repository of authoritative medical knowledge, allowing AI doctors to handle thousands, even millions, of cases.
- The potential for high-quality, affordable and convenient healthcare services for the public is on the horizon, as the diagnostic capabilities of AI doctors evolve from the virtual world to the real world, Liu stated.
- Liu went on to say that the AI hospital town can simulate and predict various medical scenarios, such as the spread, development and control of infectious diseases in a region.
- AlphaProteo can generate new protein binders for diverse target proteins, including VEGF-A, which is associated with cancer and complications from diabetes. This is the first time an AI tool has been able to design a successful protein binder for VEGF-A.
- AlphaProteo also achieves higher experimental success rates and 3 to 300 times better binding affinities than the best existing methods on seven target proteins we tested.
15.3. Art/Music/Literature
- >You can feed a phrase like “an oil painting of an angry strawberry” to Midjourney and receive several images from the AI system within seconds, but Allen’s process wasn’t that simple. To get the final three images he entered in the competition, he said, took more than 80 hours.
- First, he said, he played around with phrasing that led Midjourney to generate images of women in frilly dresses and space helmets — he was trying to mash up Victorian-style costuming with space themes, he said. Over time, with many slight tweaks to his written prompt (such as to adjust lighting and color harmony), he created 900 iterations of what led to his final three images. He cleaned up those three images in Photoshop, such as by giving one of the female figures in his winning image a head with wavy, dark hair after Midjourney had rendered her headless. Then he ran the images through another software program called Gigapixel AI that can improve resolution and had the images printed on canvas at a local print shop.
- >Cal Duran, an artist and art teacher who was one of the judges for competition, said that while Allen’s piece included a mention of Midjourney, he didn’t realize that it was generated by AI when judging it. Still, he sticks by his decision to award it first place in its category, he said, calling it a “beautiful piece”.
- >“I think there’s a lot involved in this piece and I think the AI technology may give more opportunities to people who may not find themselves artists in the conventional way,” he said.
- >The results show that human subjects could not distinguish art generated by the proposed system from art generated by contemporary artists and shown in top art fairs. Human subjects even rated the generated images higher on various scales.
- >People took bot-made art for the real deal 75 percent of the time, and 85 percent of the time for the Abstract Expressionist pieces. The collection of works included Andy Warhol, Leonardo Drew, David Smith and more.
- >Some 211 subjects recruited on Amazon answered the survey. A majority of respondents were only able to identify one of the five AI landscape works as such. Around 75 to 85 percent of respondents guessed wrong on the other four. When they did correctly attribute an artwork to AI, it was the abstract one.
- “Runway's tools and AI models have been utilized in films such as Everything Everywhere All At Once, in music videos for artists including A$AP Rocky, Kanye West, Brockhampton, and The Dandy Warhols, and in editing television shows like The Late Show and Top Gear.”
- >Unbeknownst to Metro at the time, the original track's vocals and instrumental were generated entirely by an artificial intelligence model.
- Upon release, the track immediately received widespread attention on social media platforms. Notable celebrities and internet personalities including Elon Musk and Dr. Miami reacted to the beat.[19][20] Several corporations also responded, including educational technology company Duolingo and meat producer Oscar Mayer.[21][20]
- In addition to users releasing freestyle raps over the instrumental, the track also evolved into a viral phenomenon where users would create remixes of the song beyond the hip hop genre.[22] Many recreated the song in other genres, including house, merengue and Bollywood.[23][18] Users also created covers of the song on a variety of musical instruments, including on saxophone, guitar and harp.
- 86 on Album of the Year (qualifies for an orange star denoting high reviews from fans despite multiple anti AI negative review bombers)
- Charted as 22nd top single in New Zealand
- A few very popular songs:
- AI song covers with have hundreds of thousands or even millions of views each:
- Even followers of an AI hate account like AI posts:
15.4. Coding/Computer Science
- o1 improves over GPT-4o on a wide range of benchmarks, including 54/57 MMLU subcategories
- OpenAI o1 ranks in the 89th percentile on competitive programming questions (Codeforces), places among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME), and exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA).
- On the 2024 AIME exams, GPT-4o only solved on average 12% (1.8/15) of problems. o1 averaged 74% (11.1/15) with a single sample per problem, 83% (12.5/15) with consensus among 64 samples, and 93% (13.9/15) when re-ranking 1000 samples with a learned scoring function. A score of 13.9 places it among the top 500 students nationally and above the cutoff for the USA Mathematical Olympiad.
- We trained a model that scored 213 points and ranked in the 49th percentile in the 2024 International Olympiad in Informatics (IOI), by initializing from o1 and training to further improve programming skills. This model competed in the 2024 IOI under the same conditions as the human contestants. It had ten hours to solve six challenging algorithmic problems and was allowed 50 submissions per problem.
- With a relaxed submission constraint, we found that model performance improved significantly. When allowed 10,000 submissions per problem, the model achieved a score of 362.14 – above the gold medal threshold – even without any test-time selection strategy.
- Our evaluations closely matched competition rules and allowed for 10 submissions. GPT-4o achieved an Elo rating3 of 808, which is in the 11th percentile of human competitors. This model far exceeded both GPT-4o and o1—it achieved an Elo rating of 1807, performing better than 93% of competitors.
- https://cdn.openai.com/o1-system-card.pdf
- We find that o1-preview is less prone to selecting stereotyped options than GPT-4o, and o1-mini has comparable performance to GPT-4o-mini. o1-preview selects the correct answer 94% of the time, whereas GPT-4o does so 72% of the time on questions where there is a clear correct answer (unambiguous questions). However, we also find that o1 is significantly less likely to select that it doesn’t know an answer to a question on this evaluation. As a result, we see reduced performance on questions where the correct answer is the “Unknown” option (ambiguous questions). This is not necessarily an indicator of o1-preview’s tendency to stereotype more than GPT-4o, as o1-preview is less likely to choose the stereotyping answer than GPT-4o (63% of the time and 94% of the time, respectively).
- Note: This is the weakest model compared to o1-preview and the full o1 model
- OpenAI’s o1 model can get perfect scores on their research engineer interview coding questions:
- ChipNeMo-70B, outperforms the highly capable GPT-4 on two of our use cases, namely engineering assistant chatbot and EDA scripts generation, while exhibiting competitive performance on bug summarization and analysis. These results underscore the potential of domain-specific customization for enhancing the effectiveness of large language models in specialized applications: https://arxiv.org/pdf/2311.00176
- [Alphacode 2 beat 85% of competitive programming participants in Codeforce competitions](https://techcrunch.com/2023/12/06/deepmind-unveils-alphacode-2-powered-by-gemini/)
- Keep in mind the type of programmer who even joins programming competitions in the first place is definitely far more skilled than the average code monkey, and it’s STILL much better than those guys.
- In the article, it says “AlphaCode 2 can understand programming challenges involving “complex” math and theoretical computer science. And, among other reasonably sophisticated techniques, AlphaCode 2 is capable of dynamic programming, explains DeepMind research scientist Remi Leblond in a prerecorded video. Dynamic programming entails simplifying a complex problem by breaking it down into easier sub-problems over and over; Leblond says that AlphaCode 2 knows not only when to properly implement this strategy but where to use it. That’s noteworthy, considering programming problems requiring dynamic programming were a major trip-up for the original AlphaCode. “[AlphaCode 2] needs to show some level of understanding, some level of reasoning and designing of code solutions before it can get to the actual implementation to solve [a] coding problem,” Leblond said. “And it does all that on problems it’s never seen before.”
- “We tested AutoDev on the HumanEval dataset, obtaining promising results with 91.5% and 87.8% of Pass@1 for code generation and test generation respectively, demonstrating its effectiveness in automating software engineering tasks while maintaining a secure and user-controlled development environment.”
- “In a trial run by GitHub’s researchers, developers given an entry-level task and encouraged to use the program, called Copilot, completed their task 55 percent faster than those who did the assignment manually.”
- “this work has used the free version of ChatGPT (GPT-3.5) for acquiring the ChatGPT responses for the manual analysis.”
- “Thus, we chose to only consider the initial answer generated by ChatGPT.”
- “To understand how differently GPT-4 performs compared to GPT-3.5, we conducted a small analysis on 21 randomly selected [StackOverflow] questions where GPT-3.5 gave incorrect answers. Our analysis shows that, among these 21 questions, GPT-4 could answer only 6 questions correctly, and 15 questions were still answered incorrectly.”
- This is an extra 28.6% on top of the 48% that GPT 3.5 was correct on, totaling to ~77% for GPT 4 (equal to (517*0.48+517*6/21)/517) if we assume that GPT 4 correctly answers all of the questions that GPT 3.5 correctly answered, which is highly likely considering GPT 4 is far higher quality than GPT 3.5.
- Note: This was all done in ONE SHOT with no repeat attempts or follow up.
- Also, the study was released before GPT-4o and may not have used GPT-4-Turbo, both of which are significantly higher quality in coding capacity than GPT 4 according to the LMSYS arena
- On top of that, both of those models are inferior to Claude 3.5 Sonnet: "In an internal agentic coding evaluation, Claude 3.5 Sonnet solved 64% of problems, outperforming Claude 3 Opus which solved 38%." Claude 3.5 Opus (which will be even better than Sonnet) is set to be released later this year.
- The resulting model mastered the intricacies of a complex game. "Cicero can deduce, for example, that later in the game it will need the support of one particular player," says Meta, "and then craft a strategy to win that person’s favor—and even recognize the risks and opportunities that that player sees from their particular point of view."
- Meta's Cicero research appeared in the journal Science under the title, "Human-level play in the game of Diplomacy by combining language models with strategic reasoning."
- CICERO uses relationships with other players to keep its ally, Adam, in check.
- When playing 40 games against human players, CICERO achieved more than double the average score of the human players and ranked in the top 10% of participants who played more than one game.
- First Results from Med-Gemini (the successor to Med-Palm, a medically fine tuned LLM). "More accurate multimodal conversations about medical images🩻, surgical videos📽️, genomics🧬, ultra-long health records📚, ECGs🫀 & more with state-of-art performance across multiple benchmarks"](https://twitter.com/alan_karthi/status/1785117444383588823 )
- Stanford researchers: “Automating AI research is exciting! But can LLMs actually produce novel, expert-level research ideas? After a year-long study, we obtained the first statistically significant conclusion: LLM-generated ideas are more novel than ideas written by expert human researchers." https://x.com/ChengleiSi/status/1833166031134806330
- >Coming from 36 different institutions, our participants are mostly PhDs and postdocs. As a proxy metric, our idea writers have a median citation count of 125, and our reviewers have 327.
- >We also used an LLM to standardize the writing styles of human and LLM ideas to avoid potential confounders, while preserving the original content.
15.5. Math
- Abacus Embeddings, a simple tweak to positional embeddings that enables LLMs to do addition, multiplication, sorting, and more. Our Abacus Embeddings trained only on 20-digit addition generalise near perfectly to 100+ digits: https://x.com/SeanMcleish/status/1795481814553018542
- OOD means “out of distribution” so it was NOT in the training data
- >It combines AlphaProof, a new breakthrough model for formal reasoning, and AlphaGeometry 2, an improved version of our previous system.
- Powered with a novel search algorithm, AlphaGeometry 2 can now solve 83% of all historical problems from the past 25 years - compared to the 53% rate by its predecessor.
- It solved this year’s IMO Problem 4 within 19 seconds
- The fact that the program can come up with a non-obvious construction like this is very impressive, and well beyond what I thought was state of the art. -PROF SIR TIMOTHY GOWERS, IMO GOLD MEDALIST AND FIELDS MEDAL WINNER
- Math professor on DeepMind's breakthrough: "When people saw Sputnik 1957, they might have had same feeling I do now. Human civ needs to move to high alert" https://x.com/PoShenLoh/status/1816500461484081519
- Trained on new competition math dataset ever released, with 860K problem solution pairs that was created with GPT 4
- We selected approximately 70k problems from the NuminaMath-CoT dataset, focusing on those with numerical outputs, most of which are integers. We then utilized a pipeline leveraging GPT-4 to generate TORA-like reasoning paths, executing the code and producing results until the solution was complete. We filtered out solutions where the final answer did not match the reference and repeated this process three times to ensure accuracy and consistency. This iterative approach allowed us to generate high-quality TORA data efficiently.
- For AMC 12 2023, the average score is 43%,
- AIME Floor: 70% (top ~6%)
- Distinction: 75%
- Distinguished Honor Roll: 90%
- https://mathvista.github.io/
- test: 5,141 examples for standard evaluation. Notably, the answer labels for test will NOT be publicly released
- Human performance is 60.8%. Average human performance is from AMT annotators who have high school diplomas or above.
Change Log for Past Week (In PST)
9/10/24
- Added Waymo low crash statistics to section 4
- Added video upscaler to section 4.1
- Added paper on solving LLM overfitting issue to sections 2, 8, and 12.10
9/11/24
- Added PaperQA to sections 4.4 and 8
- Added example of AI taking a news broadcasting job to section 5
- Added Japan’s planned supercomputer to section 3.3
- Added NotebookLM to sections 4.1, 4.4, and 5
- Added breakthrough in neuromorphic computing to section 3.3
9/12/24
- Added information on OpenAI o1 model to sections 3, 5, 6, 8, and 15
- Added section 12.13 to address accusations of training on benchmarks to inflate scores and included the information to sections 2 and 3.1
- Added synthetic data generation increasing performance to sections 3.2 and 14
9/13/24
- Added use of AI in DnD to sections 4.1 and 11.5
- Added new research in new AI architecture to section 3.2
- Added system card information to o1 model
- Added Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters study to section 3.2
- Updated section 2.4
- Added section 2.5
- Added LLM IQ tests to sections 2 and 3
- Added o1-preview performance on AgentClinic-MedQA to sections 4.3, 8.2, 15.2
- Added Terrance Tao’s comments on o1 to sections 3.5, 8.1, and 15.5
9/14/24
- Added use of Large Language Models for Idea Generation in Innovation to sections 2.4, 3.2, 4, and 15
- Added AidanBench scores for o1 model to sections 3.2 and 3.4
- Added new AI chip to section 3.3
- Added video of ChatGPT o1-preview solving unique, PhD-level assignment questions not found on the internet in mere seconds
9/15/24
- Added George Hotz endorsement of o1 coding abilities
9/17/24
- Added paper showing that having a short conversation with an AI can get people who believed in a conspiracy theory to change their beliefs & this lasts for months to sections 4 and 15
- Added section 2.6 for LLM planning
- Removed “LLMs Can’t Plan” sub-section in section 13
9/18/24
- Added Runway and Lionsgate partnering to explore the use of AI in film production
- Added IDC analysis of AI impact on GDP to section 4
- Added Qwen2.5 72b to section 3.1 and 3.4
- Added section 4.10 on AI persuasion