It's not clear to me that verbatim would be the only issue. It might produce lines that are similar, but not identical.
The underlying question is whether the output is a derivative work of the training set? Sidestepping similar issues is why GCC and LLVM have compiler exemptions in their respective licenses.
If simple snippet similarity is enough to trigger the GPL copyright defense I think it goes too far. Seems like GPL has become an obstacle to invention. I learned to run away when I see it.
It's not limited to similar or identical code. The issue applies to anything 'derived' from copyrighted code. The issue is simply most visible with similar or identical code.
If you have code from an independent origin, this issue doesn't apply. That's how clean room designs bypass copyright. Similarly if the upstream code waives its copyright in certain types of derived works (compiler/runtime exemptions), it doesn't apply.
So if you work on an open source project and learn some techniques from it, and then in your day job you use a similar technique, is that a copyright violation?
Basically does reading GPL code pollute your brain and make it impossible to work for pay later?
If so you should only ever read BSD code, not GPL.
> Basically does reading GPL code pollute your brain and make it impossible to work for pay later?
It seems to me that some people believe it does. Some of the "clean room" projects specifically instructed developers to not even look at GPL code. Specific examples not at hand.
Microsoft appears to believe this (or maybe just MacBU) because I've met employees who tell me they're not allowed to read any public code including Stack Overflow answers.
If that's the case then GPL code should not have been used in the training set. Open AI should have learned to run away when they saw it. The GPL is purposely designed to protect user freedom (it does not care about any special developer freedom) which is it's biggest advantage.