Having worked on commercially-resold Apache projects, can't say I argue with Wikipedia a whole lot on this. It seems to me it should be let in, but it's a bit silly to go and call them out like this on a corporate blog, IMO.
Dremio does benefit from Apache Arrow publicity and notoriety, even if they don't profit directly. Having a de-facto standard data format and open-source engines is a selling point for some. That's why Dremio explicitly calls it out on their own website. It also never hurts in the recruiting department. (edit: there's a reason the article was submitted by someone working in marketing & strategy)
>> I’m wondering if Wikipedia can continue to be considered a reliable source of information for technical folks who want to learn more about the vast system of Apache open source software projects.
Sign up for the Olympics, because that's a hell of a leap. You didn't get your page in, it's really not much of a reflection on the rest of Wikipedia. It's an open-source project. It should have it's own freely available documentation that fills much the same purpose anyway. If I want to learn about Apache X, I go straight to x.apache.org. They concede that it's not an end-user product anyway, so I'd think their key audience knows how to find an open-source project website. Lower the bar too far the other way, and there are plenty of semi-open-source project's marketing departments would be all over using Wikipedia to their own ends - I've seen my own former employer do this for their Apache projects.
This was posted by the direct of marketing, on their marketing blog... And the wikipedia article mentions, "efficient, effective, optimized" multiple times in the introduction paragraph. Compared to column-store [1], which the OPs article links to, it only mentions it once at the end and in weaker language.
As it stands, the Apache Arrow entry reads like a press release. I would recommend that Justin has a non-marketing copy editor clean it up before pressing the case further.
> I’m wondering if Wikipedia can continue to be considered a reliable source of information for technical folks who want to learn more about the vast system of Apache open source software projects.
I'm confused why the writer thinks it should be!
The Apache Foundation is a big tent. There's some clearly notable projects in there (like Apache httpd), but there's also a lot of really obscure crap that basically nobody outside ASF cares about (like Apache Creadur or Apache Pony Mail). Expecting Wikipedia to document every Apache project is ridiculous.
Is this particular project notable enough for a Wikipedia article? I don't know a lot about it, so I can't say for sure. But the article drafts that I've seen don't convince me that it is.
I would have said Wikipedia is a questionable source on "open source software projects", period.
The idea that Apache software projects are too obscure to include, while every single individual episode of Buffy the Vampire Slayer has a detailed article, is pretty typical of the site.
On some occasions I would love to have articles about open source projects that are written in the high quality and non-point-of-view tone that Wikipedia encourages. A project's own API docs are obviously going to be a better source for, well, the API, but a blunt description of what a project is is something Wikipedia can definitely provide, and I don't see why they don't.
I certainly use Wikipedia to understand what various companies are. There are companies whose own websites seem deliberately designed to obfuscate what the company is. I don't see why Wikipedia couldn't provide the same benefit for open source projects.
> Personally, I don't think I've ever used Wikipedia to learn about an OSS project.
I think I might be part of the silent majority that actually does -- I often use Wikipedia to learn about the origin story of an OSS project. (not random tiny OSS projects, but more established ones)
Project websites don't tell you stuff like the original author, key people, context, adjacent categories of software, the history, the original problem that it was trying to solve, the drama (fights, competitors, disagreements between folks involved), and evolution of the project over time. The Wikipedia article often does.
This type of intelligence is invaluable when evaluating projects/products. If you're not wiki'ing your OSS project, you'd have to google and wade through mailing lists and piece together the story from blogposts, tweets, etc.
There is a set of policies that Wikipedia is supposed to follow when it comes to deciding if a page should be in or not. Nothing in this set of policies disqualifies a page if it benefits a company. Or even if it was written by employees of that company.
Thus, Wikipedia is violating its own policies. It follows that decisions on whether a page should be created becomes arbitrary which opens up the door for corruption. Some company pays Wikipedians and get their page(s) created, others don't and don't get any page(s).
I don't disagree with anything you said, but I mentioned the benefit because the blog explicitly denies any benefit. But the reviewers do call out a conflict of interest. They criticized it because it read like an ad, and I agree. I've seen other Apache projects (looking at Drill) that read like an ad, and it's annoying.
It was a pain to get gitlab in 5 years ago after a "controversial" deletion, so it wasn't available for simple undeletion. Domain specific knowledge has it notouriously hard with wikilawyers who, at large, seemingly stopped adding new things to their world view 15 years ago.
Then it becomes a game of jumping through hoops and hoping you end up with a kind wiki-landlord or knowing a friendly wikipedia admin.
Doing the latter by anouncing your concern on social media and hoping a sympathetic admin picks it up, might be the easiest on human time and resources, just let them copy your reasonably well sourced article draft from your personal space and see what happens.
Even if it was a clumsy self-promotion or over-ambitious fans with no clue on wikipedia inner mechanics shouldn't set back a viable interest on information about a given company or other entitity by a multitude of years.
After a deletion it's just magnitudes harder for anyone to get an article restored, compared to an entity which didn't have the "luck" to get added to wikipedia too early.
Deletion history shouldn't have that much of a say on actively developing entities as it has now.
The exact opposite thing is true. If the article is bad, it needs to not be on the site. What's important are reliable articles, not how many articles there are. It's perfectly fine for a topic we know will be more obviously notable in the coming years to stall for an article until a decent one can be written.
This has been the ethos of the project practically since its inception. It's always startling to see people questioning Wikipedia's premises, since it seems pretty clearly to be one of the most successful volunteer projects in the entire history of the Internet.
Wikipedia can actually be pretty schizophrenic on the issue. Depending on timing and the interest groups involved, it can go either way.
I've personally given up on editing Wikipedia (too many fanatics with infinite time), but IMHO it needs to be much more deletionist than it is now. There is value to its current wide scope, but its maintenance model has trouble with long tail articles. It shouldn't have an article unless it can consistently gather medium-sized quorum of active editors to watch over it.
That is not what Wikipedia's policies say. They say that if a topic full-fills the notability criteria there should be an article for it. It does not say that if an article is bad it should be deleted - rather the contrary - if an article is bad, improve it!
This was the ethos of the project in the beginning but is not the ethos anymore. People have realized how valuable it is for companies and other actors to have their own article on Wikipedia. Therefore Wikipedians have created a very bureaucratic system for deciding which articles should be created. And people like to wield power. For example, by rejecting perfectly good articles.
This article was struck for not meeting the notability criteria, which involves citing reliable sources that make a straightforward claim of notability. It's not a perfectly good article.
If the problem is rejection of "perfectly good articles", why start by arguing there's no grounds for deleting bad articles? Seems like dancing around the point.
> Doing the latter by anouncing your concern on social media
Be careful about doing this. It's harmless if you're simply a concerned user, but once you're actually in a dispute with someone on wiki it can easily be in breach of their guidelines.
Open source projects are particularly tricky for Wikipedia. There are tens of thousands of them. Their owners are often passionate. They compete with each other, so there's incentive to write hard-to-adjudicate competing claims. Many have commercial backing, which further warps incentives. The projects themselves are highly technical; many, like Arrow, are software development tools and components. There are few authoritative sources that reliably track open source projects. Keeping up involves directly following bug trackers and message boards and then synthesizing a narrative, which is the definition of "original research", forbidden in the encyclopedia.
It's likely that Arrow does deserve a WP article. But Arrow's sponsors misunderstand more about Wikipedia than Wikipedia does about Arrow. Writing a defensible article about their project will require work; in particular, they're going to need to spend the time tracking down authoritative sources for why Arrow is notable, and those claims will probably need to be something more persuasive than "hundreds of companies use it"; hundreds of companies use all sorts of things that don't, and shouldn't, be featured in their own encyclopedia articles.
I understand the impulse behind "this project is important; it should have a Wikipedia article". But when you take a step back and accept what Wikipedia actually is, rather than what you think it should be, you're left with the question: do we really need to feature this particular piece of software in its own encyclopedia article? 20 years from now, will people still be getting value from it? Whatever value that might be, will it outweigh the 20 years of other people's volunteer efforts to maintain the article, keeping it free of vandalism and ensuring that it doesn't surreptitiously turn into a promotion piece for some company or another?
The answers might be "yes". But I don't see much evidence in this piece considered the questions.
Lots of things that don't seem deserving have in-depth Wikipedia coverage. Many of those things probably really don't belong in an encyclopedia! But there are two sides to this problem: the merit of the topic, and the cost, in volunteer time, of including them. A marginal topic can be defensible if it's easy to reliably cover it. A seemingly important technical topic might not be if the only way to say anything interesting about it is to write original research directly into its article.
Late edit
A useful tip for getting your open source project covered in its own Wikipedia article: don't have the Chief Marketing Officer of the company that owns the project write the article.
This is a great comment; I'll just add one other thing, which is something I've mentioned before in arguments about Wikipedia: Wikipedia's goal is verifiability, NOT truth. "Truth" is explicitly a non-goal of the Wikipedia project. For any given subject, Wikipedia is not meant to provide the truth about that subject, it's meant to be a summary and distillation of the existing reliable sources about it. If there are none, that's neither Wikipedia's fault nor its problem.
You can take issue with this goal, but that's how it works, and it's also how encyclopedias have always worked.
>and it's also how encyclopedias have always worked.
Well... Hopefully verifiability and truth have some correlation. Otherwise I'd argue that verifiability isn't worth much. What is different from traditional encyclopedias is that they did make determinations about what was important (which is at least akin to notability) and would allocate articles and pages as appropriate. From today's perspective we might dispute the judgments of importance but they were there.
Hopefully verifiability and truth have some correlation.
Not as much as you would hope.
I have two sisters with Wikipedia articles. Let's pick https://en.wikipedia.org/wiki/Jennifer_Tilly for one of them. It claims that her mother was Irish and Finnish, and goes on to list how many siblings she has. Those statements are verifiable but false. You can find an article written by reporters that said those things.
She isn't Irish, her step-father (my father) was. She also has 2 more brothers than are listed in that article. That is true, but not verifiable. Nor will they ever be verifiable. And therefore Wikipedia will never be corrected.
The problem here is that the Gell-Mann Amnesia Effect (see https://www.goodreads.com/quotes/65213-briefly-stated-the-ge... for an explanation) guarantees that there will be lots of verifiable statements that aren't so. Wikipedia builds a coherent view of a subject on that sand, and it is very hard to find what it is mistaken about. But it is riddled with errors that will never get fixed because they were wrong in a verifiable primary source.
And information not captured in a verifiable primary source will never make it in. For example her grandfather was the T in https://www.cmtengr.com/. Good luck verifying that one!
>And information not captured in a verifiable primary source will never make it in
In theory. In general? I was just looking at an article where I have a lot of personal knowledge.
Is mostly True, as far as much of my first-hand knowledge can tell. And leave aside a couple of the random personal insertions that are definitely True if outside of all proportion to the rest of the article.
But there's one section in particular that goes into even more detail than I knew even as someone fairly in the depths of this particular thing. (But it's very plausible and consistent with what I do know.) It's certainly not something that's ever been written about publicly AFAIK and the actual references in the article are minimal.
Which comes back to that notability/verifiability/etc. are nice theories--and may even make sense in the abstract--but there's a huge amount of inconsistency depending upon whether someone has taken notice of an article or not. (And, in at least some cases, I'm often happy with people not looking too hard.)
Sure. I'm also not sure that the fact that Wikipedia's rules often fall through the cracks is entirely a bad thing. You end up with some unverified information. You also end up with maybe somewhat unreliable information that would never have been verifiable. Even if I can't fully endorse this sort of informal breaking of the rules, I'm not really opposed to it either.
Are you seriously confused by my carelessness with pronouns?
Jennifer and I are siblings. Our mother's mother was Finnish. Our mother's father (the Tilly in CMT) was a complicated mix. Jennifer's father was Chinese. My father was Irish.
She was born Chan, I was born Ward, our names were changed to our mother's maiden name after her divorce from my father.
The "Gell-Mann Amnesia Effect" being the banal fact that reporters are sometimes wrong about things?
Have you tried leaving a comment on the Talk page of the article saying that you're Jennifer Tilly's sister, linking to something about you (you're obviously bona fide), and asking for a correction? WP has special reliability rules (WP:BLP) for "Biographies Of Living Persons".
It doesn't look like CMT has a Wikipedia article at all. Should it?
The "Gell-Mann Amnesia Effect" being the banal fact that reporters are sometimes wrong about things?
Sometimes?
I've yet to read a feature article written by a reporter on a subject that I know well which didn't have multiple mistakes.
Have you tried leaving a comment on the Talk page of the article saying that you're Jennifer Tilly's sister, linking to something about you (you're obviously bona fide), and asking for a correction? WP has special reliability rules (WP:BLP) for "Biographies Of Living Persons".
Actually I am one of the brothers that Wikipedia does not know about.
Back in the 2007-2008 period I decided to make some obvious corrections. They got rejected. I left some comments in talk. A couple of my comments are still there on Jennifer's talk page.
As for CMT, you tell me. It is a civil engineering company that has existed for decades and has a significant presence in multiple states. But there isn't much about them online other than the company website. Which, by definition, is not considered reliable.
I have it in for the "Gell-Mann Amnesia effect" (is there even evidence that Gell-Mann believed in it?), but your point is well taken: Wikipedia's rules do heavily privilege journalism, and journalism is merely the first draft of history, not the camera-ready final.
It's possible that Wikipedia has carefully balanced this; if they didn't privilege reporting, a lot fewer articles would get written, about a lot of things people actually do want to look up in the encyclopedia. Reliance on journalism means they'll routinely get some bad facts, but there's a bound on how bad things will be that there wouldn't be if they just got rid of WP:RS altogether.
It's much more likely that nobody has carefully thought about this, and it's just a shambolic volunteer project taking advantage of what they have to work with.
My basic take about Wikipedia is that it's hard to argue with the results. However obnoxious their policies are to nerds like us (and I commented upthread about obnoxious experiences I've had working on it --- I no longer contribute!), it's a tremendously successful project, perhaps one of the most successful in the history of the Internet.
It's bad when they have bad facts, more so when those facts pertain to living people, even more so when someone has the correct facts and can't get them accepted, and especially so when that person is a family member of the subject.
It's less bad, to me at least, that an encyclopedia happens to lack a page, for now, on Apache Arrow.
We're basically into the deletionist vs. inclusionist debate that is at least somewhat orthogonal to what laypeople think of as notability. Is a Pokemon character notable. Not really?? But because of the enthusiastic fan base tons have been written about them.
On the other hand, whether you're talking open source projects beyond the big names, corporate executives, or just people who are reasonably well known within fairly large communities, there just isn't a lot of independently sourced published material about them, especially in mainstream pubs--which (somewhat both understandably and ironically) Wikipedia tends to prefer. You even have people with tons of hits on Google but there isn't a ton of info about them online.
What "debate"? This isn't a live debate. There is a faction of people, some of whom are involved with Wikipedia, that want it to be something other than a tertiary-source encyclopedia, just like there are people who want to be able to write blog posts as Stack Overflow comments. It's true that they will never stop advocating for these changes, but there's no evidence that the projects themselves are going to cave.
Maybe it's not a debate so much as a tension--and it's a real one. Personally, I haven't contributed anything to Wikipedia in years. It's useful, I see its flaws, but I certainly don't care enough to push on it for the most part.
I'm exactly the same way. For instance: I did some writing about macOS security in the macOS articles, way back when, and most of it got struck because I couldn't cite it properly. It was frustrating to write a straightforward statement, like "the macOS Seatbelt sandboxing mechanism uses s-expressions", and have it get struck.
But I came quickly to realize the project was right. Without a reliable secondary source, I was effectively conducting research in the pages of the encyclopedia. What I learned from that was: I shouldn't be writing encyclopedia articles; the technical writing I do tends not to be tertiary.
It's fine – good, in fact – if most people don't write much in Wikipedia. It's its own special thing. You can't argue with its success: it might be the most successful project in the history of the Internet, and a long-term contender for one of the most successful volunteer knowledge projects ever.
This seems like an argument that says that Apache Arrow is as important as the paper clip, which would be an extraordinary claim.
That paper clip article is itself extraordinary. Go look at it again. It delves into the history of the paper clip, covers different designs, has excerpts from paper-clip-making-machine patents, and describes an actual controversy(!) over its invention, all carefully illustrated (illustrating things on Wikipedia is a bitch, by the way, because of IPR rules). People went through a lot of effort to make a good paper clip article.
And Wikipedia considers the paper clip article to be a "C-class article" (C here means approximately what it means in school), and the topic of "low" importance. Just so we're clear on what the bar is here.
Compare that with the author's attempt at an Arrow article:
It's a paragraph of promotional material, a brief comparison to other systems, and a citation to a blog post saying "I do not see any reason not to embrace the Arrow standard".
Come on.
I think there probably should be an Arrow article. The authors have found a bunch of reliable sources covering it; they just haven't distilled from them a defensible claim to Arrow's notability. I think it's a matter of putting the work in.
I highly suspect that with some actual effort I could find an even less deserving office item.
And you may be right that Arrow needs to do more to be notable and ready for its own page. But ignoring some objective standard and instead looking at a relative standards of other articles, it does feel like there are some unequal requirements in this regard.
The binder clip article has many of the same merits as the paper clip article. The bulldog clip article is more interesting: it's a "stub" article (its authors are explicit about the fact that it's not a complete article), and still it manages to track down some of its history and cite interesting uses from books – someone had to read those books and fish the bulldog clip cites out of them.
I think it's pretty clear to anyone why bulldog clips are in the encyclopedia, and it is only clear to subject matter experts with strong opinions why Arrow would be.
If your topic requires subject matter expertise in order to recognize its importance, the standards are unequal: you are going to need to do more work to establish its notability, because you cannot reasonably expect the layperson volunteers in the Wikipedia project to do that work for you.
An item which almost every office worker has seen or used is definitely notable enough to get an article. Yet another data format among hundreds which has yet to reach a wider audience could be, but it is not obvious.
> I understand the impulse behind "this project is important; it should have a Wikipedia article". But when you take a step back and accept what Wikipedia actually is, rather than what you think it should be, you're left with the question: do we really need to feature this particular piece of software in its own encyclopedia article? 20 years from now, will people still be getting value from it? Whatever value that might be, will it outweigh the 20 years of other people's volunteer efforts to maintain the article, keeping it free of vandalism and ensuring that it doesn't surreptitiously turn into a promotion piece for some company or another?
I really don't think a 20-year-view is a good measure of whether or not an article should exist. Even if something is forgotten in the future, if it has relevance and importance today than that alone makes the article worth existing.
For-profit businesses are particularly tricky for Wikipedia. There are tens of thousands of them. Their owners are often passionate. They compete with each other, so there's incentive to write hard-to-adjudicate competing claims. Many have commercial backing, which further warps incentives.
They are! Spend some time patrolling AfD. They're a huge problem; companies are constantly trying to get themselves into Wikipedia, because Wikipedia is heavily privileged in Google search results. But for-profit companies tend to present clearer cases for WP volunteers: they're either well-covered in reliable sources, in which case they're easy accepts, or they're not, in which case they're easy rejects.
The problem with OSS is that lots of projects probably do merit pages, but it's hard to see which ones.
>Arrow is designed to serve as a shared foundation for SQL execution engines, data analysis systems, storage systems, and more – think Pandas, Spark, Parquet, etc. Engineers across the community are working together to establish Arrow as a standard for columnar in-memory processing.
I like to think I'm fairly techy for a non-programmer but I have no idea what that means. That might be part of their problem if that is the description in their wikipedia entry.
So: find conference papers/talks by people not affiliated with Apache or the Apache Arrow project and that discuss Apache Arrow. Figure out how to incorporate the tidbits about Arrow from those papers into the article text. Add sources in footnotes. Done.
The 1st source is a blog post on a consulting company website.
The 2nd mentions Arrow only in passing, after several pages of coverage of Spark; Arrow is covered only in relation to Spark. It's a reliable source but doesn't clearly establish notability.
The 3rd mentions Arrow hardly at all; it's an implementation detail, mentioned just once, in a paper about something else.
I can't fetch the 4th.
The 5th, a story in The Register, is reliable and probably does go towards notability, though it seems to sort of argue against it (the gist of the article is that it's surprising that Arrow has been made a top-level project at all).
The 6th, in CIO, is a recap of a press release. Trade press PR recaps shouldn't be WP:RS, but WP will often accept them, or would when I was patrolling AfD; it's luck-of-the-draw. The admins who shot down Arrow's page were smart enough not to accept it.
The 7th, in InfoWorld, is promotional as well, but it's at least written in some depth. It's a straightforward notability claim. The Arrow article should draw more clearly from it, in the opening paragraph.
The 8th, in SDTimes, is written by someone affiliated with the project itself; it's citable, but WP probably won't accept it independently as grounds for notability.
Same, in effect, for the 9th, which is just a recap of an interview with the project author.
The 10th and 11th are just blog posts. They're citable if they're not contentious, but they usually won't be acceptable as WP:RS for notability.
Blog posts are prima-facie evidence of notability. Same thing with mentions in published articles. From the book (second link):
"Recognizing that Value Vectors meet the needs of other data processing engines, in February 2016, the Apache Software Foundation announced Apache Arrow as a top-level project, bypassing the standard Incubator process. Committers to the project include developers from other Apache projects such as Calcite, Cassandra, Drill, Hadoop, HBase, Ibis, Impala, Kudu, Pandas, Parquet, Phoenix, Spark and Storm.
Apache Arrow enables execution engines like Spark to take advantage of the latest operations included in modern processors, for fast analytical data processing. Columnar layout of data allows for better use of CPU caches by placing all data relevant to a column operation in as compact of a format as possible. ...
Apache Arrow software is available under the Apache License v2.0.
Dremio, a startup led by Jacques Nadeau, chair of the Apache Drill and Apache Arrow Project Management Committees, leads the development."
In the past, this and the other sources would have been more than enough to establish notability. I know that because I have created Wikipedia articles on subjects much less notable than that. The problem for Apache Arrow isn't that it isn't notable enough, it is that people have already tried four times to get it included in Wikipedia so the Wikipedians voting on new page inclusions are getting suspicious about it.
If you want to sum up something like 10 years of debate and consideration of the role of blogs as sources (it’s much more complicated than that they’re not allowed) by saying, in effect, “you’re all wrong”, well you do you.
I'm merely saying that you are wrong. Blogs are not always reliable sources in the Wikipedia world, but they can absolutely be used as evidence for notability.
Yes, routinely. You can find plenty of articles which had much less support in sources when they were created here https://en.wikipedia.org/wiki/Category:AfC_submissions_by_da... That Wikipedians rejected the article is a moot point because the argument is that the rules are not applied consistently.
Blogs are not a consistently reliable source, particularly for notability claims. It depends on the subject and on the blog. I'm not making this up; I spent a year doing AfD patrol, and this was probably the most frequently debated point in AfD arguments.
Obviously, they can't always be WP:RS, because then literally everything would be "notable", since anyone can stand up a blog about anything. You can't even logically assemble the argument you're trying to make.
I didn't claim that blogs were consistently reliable sources. I claimed that they were routinely used as evidence of notability. Evidence of notability != Reputable source.
I'm not making anything up either; I have penned several articles on Wikipedia and gotten them through the AfC process with much less notability evidence than the Apache Arrow draft had. The difference was that I used to be an established contributor so the rules were not as harsh against we as they are against newbies and unknown contributors.
Also, you can look at the link I gave you and see that the notability rules are not uniformly applied.
Of the 10 links you list (the dbsmusings link appears twice), 5 are used to back up the claim that Arrow was “donated to the Apache Software Foundation[7] in 2016, where it has been maintained and extended since.[7][8][9][10][11]”, which doesn’t really seem like it needs that many sources.
Of the other half, one appears to be some sort of marketing blogspam, one is a paper that briefly mentions that they used Arrow, and two I can't access for various reasons. That leaves one blog post that actually discusses Arrow, and the sentence it's used as a reference for in the draft article isn't about Arrow specifically, but the tradeoffs of in-memory vs on-disk storage.
Yes, these links may be independent of the Arrow project, but I'm not convinced that they add anything of substance to the actual content of the article. Mostly it looks like they were added in an attempt to game the number of references.
The blog post should have included these citations because I was left wondering what they did to support their claims. It sounds like they probably should have an article but that they also have misunderstandings of Wikipedia.
Of course. Similarly, when you are submitting a PR to an open source project or a manuscript to an academic journal, it's your job as the author to take note of contributor guidelines.
Yes; Wikipedia's expectation is that authors have researched the topic about which they are writing, and therefore they are in the best position to provide the sources from which they got their information. The editors' job is to ensure that Wikipedia's standards are met, not to re-do the research that the author should already have performed.
If the case is that no research was performed because the author is already an expert in the area, they are still expected to provide citations so that the same standard can be applied to all authors.
As a strategy for getting Dremio on the front page of HN and thus on the radar of a large group of tech people (i.e. Dremio's prospects), this is article is very clever.
The first article is a paid promotion piece, which WP won't accept as an RS.
The second is a press release by Arrow's sponsoring company, which, obviously, WP won't accept as an RS.
I have no idea what "The Silicon Review" is; this is the first time I've ever seen it. To the extent it's not a pay-to-play trade publication, it might qualify as a notability-establishing source. The fact that the "Review" does not itself have a WP page might make it harder to claim it's reliable, since it suggests nobody else knows what it is, either.
Looks like my lateral reading was sub-par (actually I didn't even try, just a quick Google/post).
The "Silicon Review" one looks like a pay-to-play as well after further review, it's used in citation on a few other Wikipedia articles, but as far as I can tell, and due to some anecdotal stories, it doesn't look good.
Good catch, thanks for spending the time to review my links. Reading your comments above, I largely agree. It's a high bar (mostly) to get an article on Wikipedia, and that's a good thing. It allows us to read the majority of content on Wikipedia without too much suspicion.
I mostly agree. It is distinctly marketing-flavored, although not to a degree that I think should disqualify it alone.
What I think should disqualify it is that it's missing a lot of detail that would make the entry genuinely useful. As it is, it's as useful as a press release. Also, it does appear to have a problem with appropriate references.
Generally speaking, I have a hard time disagreeing with the reasons listed on that page for the rejections.
Once I witnessed awesome articles [others added and I used with delight] on open source frameworks as well as some minor facts [I added] on other subjects deleted for being "insignificant" I decided I'm not donating to Wikipedia until this bullshit ends.
Wiki articles are not videos, they take humble disk space to host so I can't recognize any reason in dismissing "insignificant" information other than a stupid rule.
IMHO whatever can be considered a piece of knowledge should be there.
BTW nearly the same applies to StackOverflow - thanks to high reputation points I earnt during the early days I can see deleted questions and answers and I often see really interesting (having three-figure upvvote scores and dozens of stars) questions and very informative (also heavily upvoted) answers deleted.
It's strongly discouraged and looked down upon. Editors with a conflict of interest take up a disproportionate amount of other editors' time and are practically never able to write neutrally about themselves or their company.
1. Conflict of interest has nothing to do with whether it’s open source. If I submit a Wikipedia article about my completely uncommercial personal project it’s still a huge conflict of interest. Rule of thumb: don’t submit Wikipedia articles about yourself, your project, your employer or your employer’s project. If it’s notable enough eventually someone without a conflict of interest will do it.
2. Tons of companies use open source for marketing. This one is no different as far as I can tell. Even had the chief marketing officer submit a Wikipedia article for their project.
But it's not _their_ project, it's a project they contribute to. Google contributing an article about k8s would be a completely different thing from Google contributing an article on say Hadoop.
The problem with these rules are, that they are so selectively enforced, it is a farce. They selectively assume bad faith, where, by all objective means is none, and brush over others, as long as it is backed up by random arguably non-neutral publications.
It's near impossible to put article's up over prolific female journalist for example, because all this can be enforced since, the publications are all from the same source (the publisher they work for) or are interviews or some sort of talk or award, where it's almost never for men, which get away* by linking to a podcast.
sadly, I can jump in on the "Wikipedia fails" train here, also. In about five attempts to really change an article (different ones) in about five years, every single change was rejected, as far as I know. The changes were different, one was writing style and order of facts on a public historical event in this century; one was adding a lot of detail to the description of a popular fantasy fiction series; one was removing a controversial and provocative one-liner at the top of a page about people at the edge of (western) society; and another .. hmm I forget now, because I just gave up !
My aging colleague tells me, just keep doing the changes, they cant stop everything. However, my direct (and limited) experience is.. they do stop everything (that I try). I was logged in twice and used anonymous three times, and added citation a bit, too.
To the point of the article, FOSS projects in wikipedia ? hmm maybe there could be a clear category for that ? software projects are proliferating rapidly.. dunno
The way around mindless reverts is to first detail the proposed change on the talk page, then wait for anyone interested to object. If no one does, you can make the change live, pointing out that no objections were raised. Even if someone does object, such objections should ultimately be made actionable, i.e. it should be made clear how to address them to the other party's satisfaction.
yes, I did, and I feel that this revert behavior was more hazing/article control than substantive in all cases but one, and that one I dont personally agree.
It looks like there aren't enough independent, non-commercial articles to use as references. This is somewhat common for many newish technical projects. Add some academic papers, some usage numbers, some summary blog posts that aren't related to the project. Wiki editors are very suspicious of people from companies editing articles related to their work.
Why do you care about having a Wikipedia page for Arrow? Why is it important enough to whinge about on HN?
Wikipedia is much like Stack Overflow these days, the community has become hostile to newcomers who fail to meet their somewhat arbitrary but very exacting standards for what is allowed on their site.
Fortunately, you can just publish your own web site. No need to be bothered about not being on WP.
For those who think that edit wars, content disagreements, and innacuracies are any special realm of Wikipedia, they're not.
One of the best examples I've encountered demonstrating this is a 19th century edit revision war between the British and American publishers of Chamber's Encyclopaedia, on the topics of Free Trade, Protection Duties, Slavery, and certain salacious particulars concerning His Royal Highness, the Prince of Wales.
What's novel concerning Wikipedia is that these disputes (as with those of free software vs. proprietary software) tend to occur, or at least leave significant evidence, in the open public record.
The hard-line Wikipedia deletionists should be deleted themselves. The argument is always brought up, like StackOverflow, that they have to be ruthless or it turns into an Eternal September dumping ground of garbage, but the quality is already very uneven and gatekeeping like Cerberus doesn't help further that goal. There's already a toxic Dead Sea effect where the pedantry and politicking has chased out a lot of people that would contribute; who the hell wants to bother putting in some hours writing something up if it is just going to be summarily deleted?
Bandwidth and hard drives are cheap.
Just spitballing, but it'd be nice if Wikipedia worked a little more like Linux distro repositories. Keep the tightly curated articles in a "core", but leave room for "community" or "nonfree" collections if you want to turn them on.
> Just spitballing, but it'd be nice if Wikipedia worked a little more like Linux distro repositories. Keep the tightly curated articles in a "core", but leave room for "community" or "nonfree" collections if you want to turn them on.
I think that's a fantastic idea, especially if it would lead to a drastic reduction in the number of articles served from the main Wikipedia domain (to a number that can meet some reasonable quality and maintenance standard, maybe 10 times the size of the most comprehensive print encyclopedia, or a 1/6 of Wikipedia's current size) [1].
Most communities seem to go that way. In the beginning, most people spend their time contributing first-order content. Then, as the community grows, it attracts more meta-users who are more interested in moderating the content creators. They create ever more rules and policies requiring content creators to jump through more and more hoops. Eventually the experience becomes so frustrating that people give up.
Wikipedia seem to me to be in that situation. StackOverflow is on its way there. It has exactly the same kind of problem with "deletionists" that Wikipedia has. Perfectly good questions are often closed for very arbitrary reasons.
The whole concept of "notability" in Wikipedia-land is subjective as hell. Whether your article makes it in is simply a matter of rolling the dice the first time you submit the article.
I created an article for NodeBB, a piece of forum software used worldwide by companies small and large (including several triple A gaming companies). We got AfD'd, and now every time someone creates an article for NodeBB, the AfD is brought up and the entire discussion ends as soon as it has begun.
We even created an article the _suggested_ way, by submitting a draft for review. It got reviewed alright... instant rejection because they felt it looked like an ad. We made changes, but nobody ever took a second look at the article.
Of course, a number of defunct open-source (and some proprietary) forum softwares with zero sources are still allowed on Wikipedia, simply due to the fact that they made it through when nobody was looking :)
One could argue that we shouldn't be writing our own articles (and they'd be right), so we just quietly accepted our judgement and market NodeBB based on the merits of the software, instead of whether it appears in some arbitrary ranking of forum software.
That said, it'd still be nice if we were listed in the Wikipedia list of forum softwares.... _sigh_, a guy can dream.
We're having the same issue getting the GraphBLAS API article to be accepted: https://en.wikipedia.org/wiki/Draft:GraphBLAS. At first it was summarily deleted overnight, now we're stuck in Draft for who know how long.
The original argument was that if you can find a citation in print you can have whatever it is on Wikipedia, but that ceased to be true years ago and it has become a popularity contest and power struggle with obnoxious Wikipedia editors.
For context, some other companies contributing to it are in the GPU space, so orthogonal to CPU-centric Dremio: Nvidia, Blazing SQL, and Graphistry (us). Likewise, the pydata big guns intersect a bit here: conda, pandas, ... . This effort got a BOSSIE award for GPU dataframes this year and is taking off now that it is becoming usable for more than just framework devs. The reason we all really on it is because a standardized columnar IO streaming format is an awesome idea for compositional HPC.
It does sounds like maybe Dremio's CMO wrote the original articles and it came off centered on them? (Did not have a chance to read.)
And the same guy submitted it here 3 times. Also overtly promotional.
riboflavin, genuine suggestion: ask one of the PMC members or committers to rewrite the article from scratch from an engineer's perspective, source everything, demonstrate notability, and resubmit. If they still don't take it, move on with your life. ... but you might have generated a lot of ill-will with the Wikipedia elites here already.
I've literally never heard of this piece of software, and it's fair to say I'm much more interested in FLOSS than the average person on the internet. Why should this thing have its own article and not just appear in a list of Apache foundation projects?
Someone decided all the technical information on the subject are irrelevant and deleted all Data Rate and Technical Improvement section. Another reason was because those details were not finalised.
While it was a little frustrating that those useful information were gone as one could always found those in other source and media, but they also deleted the whole section on DensiFi [1], where all the major companies ( Apple, Broadcom, Cisco, Intel, Qualcomm, Huawei, Samsung and others ) behind the 802.11ax decided to do the work behind close door. TL;DR They were trying to push 802.11ax to the market earlier despite of all the un-resolved issues.
So I decided to add only the DensiFi section, and it was constantly being deleted within 24 hours. After a few weeks of fun the page simply got back to the original, where Data Rate and Improvement are back but DensiFi section is totally gone. So it turns out it wasn't the technical section they were trying to get rid of.
P.S We should be glad someone in the working group discovered this and called out on the action. The current WiFi 6 / 802.11ax situation and UX is much better than what we had when 802.11ac were shipped. Although this is at the expense of somewhat 2 years delay of the standard.
There's many interesting and good points in the discussion here, thank you!
To add my 2 cents:
- Apache Arrow is notable, deserves a Wikipedia page. It might not have been when someone first tried to create a Wikipedia page for it in 2017 (see https://en.wikipedia.org/w/index.php?title=Draft:Apache_Arro...), but in the three years since it has become a major project, see e.g. https://blogs.apache.org/foundation/entry/the-apache-softwar... Notability is clearly subjective, depends on what the author and reviewer find interesting. In the variant I submitted yesterday I tried to make it clear why it's notable - Apache arrow is a standard format that connects different languages, runtimes, data systems, communities, e.g. the Python and Java data communities. See e.g. https://wesmckinney.com/blog/apache-arrow-pandas-internals/ - Apache Arrow is to my knowledge partly the brainchild of Wes McKinney, creator of pandas, it's his attempt (looking strongly like success) to resolve a major issue in data science.
- I think it's a good point Justin made at https://www.dremio.com/why-apache-arrow-wikipedia/ that it's bad that Wikipedia editors reject articles on stuff they know nothing about - if you look at their profiles, they don't seem to have any knowledge or interest about technology or software. That's not a good system.
- I haven't contributed to Wikipedia really before, and I don't understand the rules, I admit that. Probably what I did yesterday was just not following their process, and that's the reason my edit was reverted. I guess it's also true that Justin at first didn't do a great job at submitting an impartial, non-PR article. However, my understanding from looking at some drafts and the talk page is that he then took the editor comments into account, and the last variant of the page he tried to submit in July 2019 was OK.
- So overall I think the answer to the question "Why isn't there a Wikipedia page on Apache arrow?" is that it's an unfortunate case of authors and editors not doing a great job. At least I'm pretty sure I didn't do a good job yesterday, I wanted to help, but only had an hour, not a day to learn how Wikipedia ticks and to do more research to find better references. I hope someone with more experience in Wikipedia and Arrow will try to re-write and re-submit the Wikipedia article in the future.
- The rule to discourage (or forbid?) people involved with Apache Arrow from contributing to its Wikipedia page is unfortunate. I recently started to use it and learn about it, but I don't know much about it at this point. E.g. Wes McKinney has written at this point 8 high-quality blog posts about it (https://wesmckinney.com/archives.html) - those don't count as references? Even if he or the Apache Arrow team wrote a paper about it, it wouldn't count because it's a primary source, and Wikipedia only wants secondary sources to establish notability? There are ~ 100 videos on YouTube, and many blog posts and a few podcasts (e.g. https://softwareengineeringdaily.com/2016/07/17/apache-arrow...) that mention Apache Arrow. Naturally almost all of them are from Apache Arrow contributors, or from companies using Apache Arrow.
- Apache Arrow has an interesting story, and it has evolved over the past years and will keep evolving, so I think exactly for that reason a Wikipedia page would be good to have, since the current project page and old blog posts don't capture that well.
I dunno, this kind of thing seems like exactly the canonical argument for deletionism. Maybe there's no cost to a page sitting on Wikipedia describing, like, some guy's special attack from Naruto. There are reasonable arguments that allowing things like that would set a bad precedent and encourage behavior that doesn't help the project, but I admit it's pretty tenuous.
There are obvious and important costs if Wikipedia articles start being perceived as promotional material rather than encyclopedia entries.
> Maybe there's no cost to a page sitting on Wikipedia describing, like, some guy's special attack from Naruto.
There is a cost, but it's measured in hours of maintenance labor not bytes of storage.
If Wikipedia wants to maintain a semblance of accuracy [1] in the face of declining participation, it needs to concentrate its labor resources rather than spread them out.
[1] which IMHO is vital given its unwise prestige as arbiter or truth
Since Wikipedias concentration of labor itself is a source of declining participation[1] it's doubtful that continuing this behavior will result in something else than a death spiral with even fewer people ready to do the work, more concentration, even fewer .. and so on.
> Since Wikipedias concentration of labor itself is a source of declining participation[1] it's doubtful that continuing this behavior will result in something else than a death spiral
I'm not as interested in the viability of Wikipedia's culture than the reliability of Wikipedia as a resource given its prominence. I'd take a dead Wikipedia over one that's lively and fun but full of crap and poorly-checked influence attempts.
It's never going to recapture its halcyon days, so it's going to have to evolve with the times in more ways than one.
Dremio does benefit from Apache Arrow publicity and notoriety, even if they don't profit directly. Having a de-facto standard data format and open-source engines is a selling point for some. That's why Dremio explicitly calls it out on their own website. It also never hurts in the recruiting department. (edit: there's a reason the article was submitted by someone working in marketing & strategy)
>> I’m wondering if Wikipedia can continue to be considered a reliable source of information for technical folks who want to learn more about the vast system of Apache open source software projects.
Sign up for the Olympics, because that's a hell of a leap. You didn't get your page in, it's really not much of a reflection on the rest of Wikipedia. It's an open-source project. It should have it's own freely available documentation that fills much the same purpose anyway. If I want to learn about Apache X, I go straight to x.apache.org. They concede that it's not an end-user product anyway, so I'd think their key audience knows how to find an open-source project website. Lower the bar too far the other way, and there are plenty of semi-open-source project's marketing departments would be all over using Wikipedia to their own ends - I've seen my own former employer do this for their Apache projects.