Hacker Newsnew | past | comments | ask | show | jobs | submit | ramgorur's commentslogin

I always enjoyed his dribbling.

https://www.youtube.com/watch?v=JLVmBJEzXhk


There are many interesting ideas about how to come up with a "fair" online dating market.

But I guess everyone is completely disregarding the basics of human nature, attraction is mostly based on "looks/facial aesthetics":

https://www.reddit.com/r/BlackPillScience/

In a free/unregulated sexual market, women are the selectors, not men. Women are biologically hardwired to select good looking males (irrespective of his money, personality or social status). Therefore ugly people don't get matches. Very simple.

No matter how many ways you try to engineer/fix the system, it's not going to work. Because the existing system is already doing what it is supposed to do.


I don't think that's universally true. Being attractive certainly can help make good impressions easier (and I've known insufferable people coast by on good looks too) but I don't think it's the be-all or end-all of romance.

I've known women that have broken up with or turned down attractive men because they didn't enjoy talking with them, or had different political views, or wanted to have/not have children. Conversely, I've known men that didn't exactly look like movie stars end up in relationships with nice-looking women too. That's not even considering same-sex attraction either. Heck, I've turned people down to date less physically-attractive people in the past too.


In the online dating market of 2019 looks is all that matters. Because of these reasons:

1. Online dating mostly depends on looks, it's based on your pictures. Almost no woman reads your bio. Simply because people don't have time (this is another reason why new apps like tinder and bumble even don't give you enough spaces to write about yourself, because it doesn't matter at the end).

2. If a woman does not find an attractive man "interesting" after first couple of dates or so, she can easily find another attractive man with "interesting" personality. They do not have to go for an ugly male with interesting personality. Because women (even average/non-attractive ones) have an extremely large dating pool, men don't. For example:

https://forum.bodybuilding.com/showthread.php?t=168948903&pa...

3. Almost no woman picks a mate based on political views. I have seen die hard leftist women dating neonazi guys.

https://mtonews.com/.image/t_share/MTUzODEzNzk4OTYyMDc5NDg2/...

Same sex (specially between men and men) is a completely different ballgame. The mate choice strategy of women is completely different from that of men.


I am going to hit 37 this year. I am still a student doing a PhD (an international student). These kind of story scares me.


Those who are wondering why Indian languages are a subgrouop of Indo-European languages, this chart might be helpful. It shows some of the many words that Sanskrit and Slavic both share.

https://www.rbth.com/blogs/2014/11/01/sanskrit_and_russian_a...


Not all indian languages are in that subgroup. There are another family of dravidian languages.


Why do all those thinkpad bios logos feature anime girls? Also I am seeing one with that "Hitler and gas" meme. LMAO.


I am actually quite surprised to see this kind of article on hn. Well, I am even more surprised because it's getting upvotes.

Please get a copy of Russel-Norvig's A.I. text and read at least some part of it before clicking that link.


Interestingly, the style of art and writing is very similar to that of Hokusai Manga. Which is considered as the one of the predecessors of today's Manga and Anime.

https://en.wikipedia.org/wiki/History_of_manga


Why all these sightings are never accompanied with sonic booms? If something is moving that fast through the atmosphere, there must be a sonic boom. Or is there any way to suppress this? At least theoretically?


I do remember reading about an airframe designed to produce two sonic booms that cancelled each other out.


NASA is actually working on this, if I remember correctly. They want it to be a "low boom," apparently it's still a work in progress.

https://www.washingtontimes.com/news/2018/nov/6/nasa-quiet-s...


Yea, looks like it's possible to some extent.

https://physics.stackexchange.com/questions/270969/is-it- possible-to-noise-cancel-a-sonic-boom


Separating a pocket of space-time from itself, then moving that. Which would also effect inertia. I suspect that moving space-time around itself would introduce totally different shearing effects but maybe those could be mitigated using special non-physical(or physical) geometry, like aerodynamics but for space-time.


I did not understand the paper very well.

1. It's theoretically impossible to guarantee a convergence to global optima using gradient descent if the function is non-convex.

2. The only way to guarantee is to start the gradient descent from different points in the search space or try with different step sizes if the algorithm only starts from the same point in the search space.

3. Also does "achieving zero training loss" mean the network has converged to the global optima? I used to know you will get zero training loss even if you are at a local minima as well.

Please correct me if I am wrong.


1. > It's theoretically impossible to guarantee a convergence to global optima using gradient descent if the function is non-convex.

This is false. See, e.g., [0][1].

2. I'm not really sure what the question is here.

3. If your loss is bounded from below (it is a square norm) by 0 and you achieve 0 loss, this means that 0 is a global optimum, since, by definition, no other objective value can be smaller than this number.

---

[0] Theorem A.2 in Udell's Generalized Low-Rank models paper https://arxiv.org/pdf/1410.0342.pdf

[1] B&V Convex Optimization (https://web.stanford.edu/~boyd/cvxbook/), Appendix B.1. In fact, I can't find the reference right now, but you can easily prove that GD with an appropriate step-size converges to a global optimum on this problem when initialized at (0,0), even though the problem is non-convex.


that reference you can't find right now seems rather pertinent?

I think the OP intended and should have written:

"It's theoretically impossible to guarantee a convergence to global optima using gradient descent for an arbitrary non-convex function."

For example consider the function f(x)=sin^2(pi * x)+sin^2(pi * N/x) this function has multiple global minima at the divisors of N, where it is f(x)==0, if x or N/x is non-integer, it is guaranteed to be positive...

I am not taking a stance on if gradient descent does or does not guarantee finding global minima and is thus able to factorize cryptographic grade RSA products of primes, but the claim does appear to imply it.

Edit: the multiply symbols changed some cursive


> that reference you can't find right now seems rather pertinent?

Here it is: https://arxiv.org/pdf/1707.08706.pdf (This isn't quite the one I was thinking of, so I'll dig a little deeper, but it covers the idea).

Some slightly more technical conditions have to hold in order to have vanilla GD work (since the function is non-differentiable at points), but a (very!) slightly generalized variant for convex functions—sub-gradient descent—works.

> for an arbitrary non-convex function.

Sure, but this is also obvious since it is NP-hard to reach global optima in arbitrary non-convex problems. Additionally, specifically on the case of GD, I can give simple examples that always fail (consider f(x) = 1 everywhere except f(0) = 0. GD always fails whenever the initial point, x_0 ≠ 0, since the gradient is zero everywhere, except at one point. Additionally, picking initializations randomly, we reach the global minimum with probability 0 whenever we have support with non-empty interior).

I'm afraid I disagree that this is what the OP intended, though, and I also disagree that the paper's claim implies what you've said, since they only study a very specific subproblem (e.g. minimization of empirical loss on ResNets).

The relative "ease" of this task vs solving arbitrary NP-hard problems is not difficult to believe, since, given a bunch of training examples, I can always generate a resnet that fits those examples perfectly (i.e. with zero loss) in poly-space in a very dumb way: first, generate a circuit that matches the look-up table of the training samples (which is poly-space in the number of samples and can be done in poly-time), then map that circuit to an NN.


>Some slightly more technical conditions have to hold in order to have vanilla GD work (since the function is non-differentiable at points)

which function is non-differentiable at points? if you refer to my example, it is only nondifferentiable at x=0 and x=inf which are both uninteresting points since they arent divisors of N, for all the rest f(x) I gave is differentiable and lipschitz continuous of order infinity

This in contrary to your pathological example of f(x)= { 1 (x!=0); 0 (x==0) ... of course GD can not work there, and I wouldn't fault the paper for it...

don't misunderstand me, the paper is interesting, but the title and certain phrasings are very misleading IMHO

Still I think the approach by others is more interesting: by looking at the absolute error between a fixed underlying NN as "ground truth" and observing the error of the training NN (of same architecture as ground truth NN) trained to match the underlying NN


> which function is non-differentiable at points?

Sorry, this was referring to the construction provided in the paper referenced.

I do agree that the title is somewhat misleading, since, when I first read it (and thought, "this is probably wrong"), I imagined that it proved that given any resnet, you can show convergence to the global optimum via GD, not just "a resent of a given size converges to a global optimum, via GD, for a specific training set."

That being said, the paper does not prove (nor claim to prove) general, globally-optimal convergence of GD, which is what I think you're saying (given, for example, what you mentioned about finding the factorization of a semiprime in the GGP and your specific function construction)—which is what I was pushing back against a bit. In particular, even in the title, they only claim to prove this for a specific class of problems (i.e. NNs).

> Still I think the approach by others is more interesting: by looking at the absolute error between a fixed underlying NN as "ground truth" and observing the error of the training NN (of same architecture as ground truth NN) trained to match the underlying NN

I'm afraid I haven't seen this approach, but it would be interesting. Do you have references?


Thanks for your added comments, it's really helpful to see a more candid breakdown of others views of a paper.

>I'm afraid I haven't seen this approach, but it would be interesting. Do you have references?

They are referenced in the paper in the section:

>Another way to attack this problem is to study the dynamics of a specific algorithm for a specific neural network architecture. Our paper also belongs to this category. Many previous works put assumptions on the input distribution and assume the label is generated according to a planted neural network. Based on these assumptions, one can obtain global convergence of gradient descent for some shallow neural networks [Tian, 2017, Soltanolkotabi, 2017, Brutzkus and Globerson, 2017, Du et al., 2018a, Li and Yuan, 2017, Du et al., 2017b]. Some local convergence results have also been proved [Zhong et al., 2017a,b, Zhang et al., 2018]. In comparison, our paper does not try to recover the underlying neural network. Instead, we focus the empirical loss minimization problem and rigorously prove that randomly initialized gradient descent can achieve zero training loss.

I had this idea independently but never pursued it due to lack of time. It's another reason I like this paper: for referencing this approach, at least if they investigate what I think they do, I still havent had time to read those references, but from the description in this section it appears they investigate nearly if not exactly what I wanted to investigate "some day" :)


The title of the paper is really misleading. The comments here are even more misleading. The key is their theorem where they say "with high probably over random initialization". They initialize many times and sometimes it converges. Single initialization can stuck in local minimum of course.


but then there is very little of interest: (assuming enough smoothness and Lipschitz continuity) one expects every global minima to have a convex neighbourhood such that gradient descent starting within the neighbourhood reaches the global minimum. The initialize many times and sometimes it converges is just saying the obvious "there exist initial positions for which GD succeeds in finding a global minima"... is my interpretation correct?


The contribution of the paper is the estimation of the convergence speed and number of parameters in neural network, that seems a valuable point.


The paper states for the main result:

>In this section, we show gradient descent with a constant positive step size converges to the global minimum with a linear rate.

This is rather ambiguous: it sounds like it guarantees "it WILL converge to the global minimum, btw at a linear rate" but I suspect they are really saying "IF it converges to the global minimum, THEN it will converge at a linear rate" in a way to purpousefully sound like the first statement.

could you comment on if GD does or does not find the global minimum of the integer factorization cost function in the following comment?

https://news.ycombinator.com/item?id=18439287


Another sin they have is that they write "with high probability" in the theorem but they do not strictly define that in the main section of the paper. If you look into appendix you'll see that they guarantee the probability 1 - \delta and delta affects the convergence. It means that if you add many many parameters then you just need a couple attempts to converge well. So "IF it converges" is becoming much better "VERY OFTEN it converges". Sorry, no real passion to read the paper in details, so this is just an intuition from a quick look.


1) Just because you can prove convergence for convex functions does not mean you can't prove it for non-convex functions.

2) This is "a way" not "The only way".

(If A then B) does not imply (if not A then not B)


>Just because you can prove convergence for convex functions does not mean you can't prove it for non-convex functions.

I don't understand. How do you prove a gradient descent is guaranteed to escape local minima?


Convex functions aren't the only functions with a single local (and global) minimum - consider sqrt(|x|) for a simple 1d example.


Yes, that's true. But in optimization domain the concept of "convexity" is understood in terms of set, not always from the 2nd derivative of a function. Because you might have search spaces where you are not able to differentiate the objective function at all. In those cases the "convex" means a "convex set".


Sure: you can define convexity for a function that is equivalent to the 2nd derivative definition in the case that the function is twice-differentiable, using only the definition of the convexity of a set.

Define the epigraph of a function to be the set given by {(x, t) | f(x) ≤ t}. Then, we say f is a convex function iff the epigraph is a convex set.

This is equivalent (exercise for the reader!) to the usual definition that a function f is convex iff f((1-t)x + ty) ≤ (1-t)f(x) + tf(y), for all 0 ≤ t ≤ 1, with x, y in the domain of f.

Note that neither of these two definitions require differentiability (or twice-differentiability), but the definitions are equivalent in this case.[0]

---

[0] For proofs of all of these statements see B&V's Convex Optimization.


When talking about "convex optimization" one nearly always means that both the function and the domain are convex.


No need to define 'convex optimization' here, that follows from the definition of a convex function: F(ax + (1-a)y) <= aF(x) + (1-a) F(y) for 0<=a<=1 and x,y in domain of F. For the inequality to be satisfied ax + (1-a)y has to be in the domain. That mean the domain is convex.


FF NNs of even one hidden layer are universal approximators. That is, they do find the global min. What this doesn't tell you is that it's likely a huge graph and will take a looong time to optimize for even trivial data sets. There's lots of proofs around. That's why SGD is used, and for only a small subset of training points at a time.

Re 2: No.

Re 3: Yes.


They are universal approximators, although the term approximator means there's an upper bound on the accuracy of how well they emulate a given function (based on the size of the network). However, just because they are universal approximators doesn't mean that you can automatically infer the optimal number of connections and weight for each of those connections in order to minimize the loss over some dataset (derived from some function). Being able to be a universal approximator does not imply you can automatically learn the best approximation of a given function. It's the difference between being capable of learning something, and having learned it. If that makes sense.


I dont think you understand what universal approximation means. It means there are parameter settings that would reduce the approximation as much as you want. Its an existential property. It does not mean that those parameters can be found.

Anyway this universal property of neural networks get a lot of airtime and people go gaga over it. Its a complete red herring. Its not the first example of a universal approximation and not the last. There is no scarcity of universal approximators. There was no such scarcity even 100s of years ago. The explanation of the success of DNNs lie elsewhere.


Just to add, J.C. Bose also authored first science fiction in bengali language.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: