Smarter summaries with finetuning GPT-3.5 and chain of density

Der_Einzige · on Nov 13, 2023

One of the fun parts of AI is finding out that abstractive summarization is "easy", but extractive summarization (which is what humans do far more often in practice) is still very hard. Partly because most datasets assume sentence level extractive summarization, which is often not how humans summarize documents.

There's still tons of very low hanging fruit in the summarization work. I'm not aware of significant followup work to pointer networks besides pointer-generator networks, which these days are considered old news. Pointer based architectures are the ideal system for word level extractive summarizers, yet the very best extractive summarization systems today are usually nothing more than sentence selectors using some kinds of embeddings and cosine similarity.

Happy to see such success with abstractive summaries, but the kind that myself and most other humans are interested in is still far from solved.

msp26 · on Nov 13, 2023

Could you point me to more reading on extractive summarisation? A lot of what I see feels out of date compared to what should be possible now with LLMs.

axpy906 · on Nov 13, 2023

You don’t need an LLM for extractive summarization. It’s pulling out the most meaningful sentences from the article. Not sure what the parent meant.

jimmySixDOF · on Nov 14, 2023

Yes and then within that there are variations on a large text with chapters without chapters, conversational/meeting records from whisper, etc etc and they each need a different approach to the problem.

tobbe2064 · on Nov 13, 2023

Am i reading it right that the fine tune a model using 20 examples and 5 epochs? That seems really weird for me

riku_iki · on Nov 13, 2023

LLMs are few shots learners, that's why many people put examples into prompt, this is the next step.

ed · on Nov 13, 2023

I don’t believe few shot performance dictates how quickly you can fine-tune.

Most fine tunes will have much larger datasets (I am under the impression you want 10’s of thousands of examples for most runs).

So I’m similarly impressed 20 examples would make such a big difference.

But also note entity density decreases as example count increases. This is counterintuitive — maybe something else is going on here?

jxnlco · on Nov 14, 2023

usually higher parameter models do better with less training data, seperate from few shot learners, but related in other ways.

bravura · on Nov 14, 2023

https://github.com/huggingface/setfit gets good fine-tuned scores on some downstream tasks with just 8 labeled examples.

isoprophlex · on Nov 13, 2023

Can't overfit when your learning rate is zero! insert smart thinking meme

miket · on Nov 13, 2023

Here's a good way to identify how entity-dense your text is: https://demo.nl.diffbot.com/

huac · on Nov 13, 2023

nice work! generating good example data is the most important part of finetuning.

imo summarization is also a fairly simple task -- I wouldn't be surprised if a fine-tuned open source model (eg llama 13 / mistral 7) would get to similar performance.

robbomacrae · on Nov 13, 2023

I find that bart large 410m [0] parameters does a fine job at summarizing. In Summer AI I alternate between using a copy of that bart large getting hyper-trained on feedback and Chat GPT 3.3 and honestly I don't have a preference between the results.

However, thanks to this article I might revisit the summarization techniques used a try a fine tuned 3.5.

It would be great to see these techniques compared to Chat GPT 4 Turbo.

[0]: https://huggingface.co/facebook/bart-large-cnn

jxnlco · on Nov 13, 2023

for sure! the one thing i was surprised by was how little data gpt3.5 needed, could love for a company to try how the scaling laws work for those smaller models.

gwern · on Nov 14, 2023

Summarization is a fundamental capability these models develop early on. Remember that one of the things that impressed people about GPT-2 was the discovery that you could prompt them for summaries just by appending "tl;dr" to some text?

The issue was always quality, length of summarized text, and control of exactly what kind of summary.

esafak · on Nov 13, 2023

Those repeated calls sound like a good way to rack up a bill and incur a high latency.

jxnlco · on Nov 13, 2023

right which is why finetuning on the last one is a great save but preserves quality

sandGorgon · on Nov 13, 2023

has anyone finetuned gpt 3.5 or llama, etc using their private data ? what is the best practice to generate training data.

one way i have heard is to send a chunk of data to gpt4 and ask for questions to be generated. unsure of other ways. what has worked well ?

SubiculumCode · on Nov 13, 2023

If its a small amount of data, it seems RAG pieplines are better. is all I think I know.

vjb2tq4dws · on Nov 13, 2023

here is an example on how to generate synthetic data that you can adapt for your case https://dzlab.github.io/2023/09/22/palm-synthetic-data/

just_boost_it · on Nov 13, 2023

Is this proven to work? ML models are usually trained to learn a model of the environment by giving them environment data. I would have expected feeding it model outputs just trains it to learn a model of the model creating the data.

Without seeing some kind of demonstration otherwise, my feeling is that it would be like regressing stock price on inflation, then trying to generate more data using the regression model and random inflation numbers. All you'd learn is the model that you put in to generate the data.

valine · on Nov 13, 2023

I'd think of it less like teaching the model something new, and more like enforcing a behavior the model can already output. Any decent raw model can output function names and parameters with prompt engineering. To do function calling, you need the model to output function names reliably for a wide variety of prompts. That's where the fine-tuning comes in.

just_boost_it · on Nov 13, 2023

I could very easily believe that if I saw proof, but it just feels a bit wrong to train a model on model outputs.

Even in the main article here, the model did better with fewer fine tuned examples. To us, the auto-generated examples might look different enough and might look good enough, but they were all generated algorithmically. Feeding more examples in might easily be leading it to focus on some artifact of the embeddings or generating model that we just don't perceive.

visarga · on Nov 13, 2023

> it just feels a bit wrong to train a model on model outputs

If you have a small student model and a large teacher it makes sense, the student is better off after this distillation.

If you have a way to filter out low quality synthetic examples then it would be useful to generate a bunch more and take the best.

If your LLM is an agent, then it can generate feedback signals from the environment. Even a human-AI chat is a form of environment for the model. Every human response can be evaluated as positive or negative reward.

More fundamentally, organic datasets are very unbalanced, LLMs need more complex reasoning chains than what is usually available. There are some exceptions - in scientific papers, manuals and code you get very complex reasoning chains. But not in general. This issue can be fixed with synthetic data.

And even in principle, if you have a model at level N and want to make a dataset at level N+1, then you need to boost your model. You can give it more tokens, more attempts or more tools.

jxnlco · on Nov 14, 2023

theres a whole literature on distilation and student teacher networks

deepsquirrelnet · on Nov 14, 2023

As far as I can tell, the original chain of density paper doesn’t iteratively prompt. The steps of the chain are done in the generated text of a single prompt.

Terretta · on Nov 14, 2023

The prompt:

  Article: {{ARTICLE}}

  You will generate increasingly concise, entity-dense summaries of the above Article.

  Repeat the following 2 steps 5 times.

  Step 1. Identify 1-3 informative Entities (";" delimited) from the Article which are missing from the previously generated summary. 

  Step 2. Write a new, denser summary of identical length which covers every entity and detail from the previous summary plus the Missing Entities.

Although this appears to refer to a "previous" summary, from a previous run, would would appear to suggest it is run anew for each iteration, in fact it generates an "Initial Summary:" on the first run, and generates a Step 1 and Step 2.

For me though, it stops there. I can, however, simply say:

  repeat again

And it does another Step 1 and Step 2, and stops.

I will note that the third iteration, on an article such as ...

https://www.reuters.com/technology/cybersecurity/payments-ap...

... is indeed increasingly superior each iteration. By the third iteration (the fourth summary), it did seem ideal. The fourth iteration and fifth summary added entities that feel extraneous.

deepsquirrelnet · on Nov 14, 2023

The prompt in the original article says:

> Answer in JSON. The JSON should be a list (length 5) of dictionaries whose keys are “Missing_Entities” and “Denser_Summary”

Also I think it doesn’t make sense to write in the prompt for gpt to iterate if it is not doing the iteration. There is not templating of the step number or recursive summary injection in the sample prompt either.

Terretta · on Nov 14, 2023

Absolutely correct, I glazed over the continuation. So, for the record:

  Article: {{ARTICLE}}

  You will generate increasingly concise, entity-dense summaries of the above Article.

  Repeat the following 2 steps 5 times.

  Step 1. Identify 1-3 informative Entities (";" delimited) from the Article which are missing from the previously generated summary.

  Step 2. Write a new, denser summary of identical length which covers every entity and detail from the previous summary plus the Missing Entities.

  A Missing Entity is:
  - Relevant: to the main story.
  - Specific: descriptive yet concise (5 words or fewer).
  - Novel: not in the previous summary.
  - Faithful: present in the Article.
  - Anywhere: located anywhere in the Article.

  Guidelines:

  - The first summary should be long (4-5 sentences, ~80 words) yet highly non-specific, containing little information beyond the entities marked as missing. Use overly verbose language and fillers (e.g., "this article discusses") to reach ~80 words.

  - Make every word count: re-write the previous summary to improve flow and make space for additional entities.
  - Make space with fusion, compression, and removal of uninformative phrases like "the article discusses".
  - The summaries should become highly dense and concise yet self-contained, e.g., easily understood without the Article.
  - Missing entities can appear anywhere in the new summary.
  - Never drop entities from the previous summary. If space cannot be made, add fewer new entities.

  Remember, use the exact same number of words for each summary.

  Answer in JSON. The JSON should be a list (length 5) of dictionaries whose keys are "Missing_Entities" and "Denser_Summary".

And yes, the answer as a JSON list length 5 causes 5 summaries to get spit out!

However, it's not fully clear to me that it's considering the prior summaries on the later summaries in a good/useful way. The expressly iterated results I get are superior to the inline list of results.

"More research is needed." -- https://www.explainxkcd.com/wiki/index.php/2268:_Further_Res...

deepsquirrelnet · on Nov 14, 2023

With causal masking and autoregressive token generation, it’s not clear to me that it is inherently different.

My original expectation was the same as the way instructor software implemented it. But I found the prompt in the article confusing toward that perspective. I’m sure it can work either way, but it should be a lot more performant (and less expensive) as a single pass.

Terretta · on Nov 14, 2023

I can't get it to respect this instruction in single pass mode:

  - Never drop entities from the previous summary. If space cannot be made, add fewer new entities.

Specifically, for me it randomly drops entities.

> a lot more performant

Faster? Absolutely. But I'm not having luck getting it smarter.

themonk911 · on Nov 13, 2023

Gotta admit I spent some time thinking this was a new technique called 'chain of destiny' and was reading through the article trying to work out what kind of fate-based prompt engineering was happening.

mpalmer · on Nov 13, 2023

https://m.youtube.com/watch?v=jGxuWWGo8AY&t=9

rzzzt · on Nov 13, 2023

It's a forgotten Wolfenstein sequel!

intelVISA · on Nov 13, 2023

Did the exact same thing :)

jph00 · on Nov 13, 2023

Minor correction: the article describes Chain of Density as "First introduced by Salesforce's AI Research wing" -- however the 1st author (who is a PhD student) and senior author are both at Columbia; only one of the 5 authors is at Salesforce.

hackernewds · on Nov 13, 2023

prepared to see all these companies "invent" these techniques. fwiw people believe OpenAI "invented" chatgpt, whereas the inventors of the transformer model individually worked at competing companies during the research (Google Brain) and presently founded competing companies now.

vinni2 · on Nov 13, 2023

The novelty of chatgpt was instruction tuning of transformers using reinforcement learning with human feedback and finding right dataset as well as annotations for it. before this transformers were good for some tasks but not so good for generating text. Even though OpenAI didn’t invent transformers they did invent the technique needed to make chatgpt possible.

jxnlco · on Nov 13, 2023

I'll fix this now!