Wikipedia:Request a query

From Wikipedia, the free encyclopedia
(Redirected from Wikipedia:SQL requests)
Jump to navigation Jump to search

This is a page for requesting one-off database queries for certain criteria. Users who are interested and able to perform SQL queries on the projects can provide results from the Quarry website.

You may also be interested in the following:

  • If you are interested in writing SQL queries or helping out here, visit our tips page.
  • If you need to obtain a list of article titles that meet certain criteria, consider using PetScan (user manual) or the default search. Petscan can generate list of articles in subcategories, articles which transclude some template, etc.
  • If you need to make changes to a number of articles based on a particular query, you can post to the bot requests page, depending on how many changes are needed.
  • For long-term review and checking, database reports are available.

Proportion of biographies[edit]

I'd like to know, if possible, the proportion of biographies where the subject is:

  • American
  • a sportsperson (if there isn't an existing definition of this, then their article is in a subcategory under Category:People in sports works for me)
  • An American sportsperson

Thanks. Thryduulf (talk) 10:28, 12 October 2022 (UTC)Reply[reply]

The category tree on enwiki is useless for this sort of question - go deep enough into, say, Category:Births to get to all the individual years' articles, and you start finding ones like Category:Sulaiman Mountains. Category:People in sports and Category:American people are just as bad. You might have better luck at d:WD:RAQ. —Cryptic 11:03, 12 October 2022 (UTC)Reply[reply]
On the other hand, if this is for WT:ITN#Is there an editorial policy on Main Page Recent Deaths entries?, then you're probably interested in the number of biographies of living people. That's easy, since CAT:LP is non-diffusing. And while counting the number of biographies of living Americans and of living sportspeople is still hard, there's a special case of the latter that's also feasible: essentially all footballers transclude {{Infobox football biography}}. quarry:query/68028: there's 1048809 blps, and 168886 footballer blps (1 out of 6.2). This is actually slightly worse than when I first ran these queries in early 2019 (quarry:query/33441, quarry:query/33442) and got 1 out of 6.4. —Cryptic 20:32, 12 October 2022 (UTC)Reply[reply]
If Wikidata results will do, then have a look at https://w.wiki/5pBZ I don't know how to limit it to entries associated with an English Wikipedia article, but I believe that is possible. WhatamIdoing (talk) 02:23, 14 October 2022 (UTC)Reply[reply]
@Thryduulf@WhatamIdoing This is two weeks late, I'm afraid, but in case it is still of interest - per wikidata, English WP has -
The WHGI tool has pre-crunched the data and reports, as of mid-October, 1,908,734 enwiki biographies; this would make our pool of biographies 20.4% Americans, 33.1% athletes (!), and 5.1% American athletes, give or take. Strikingly, those numbers imply that American bios are noticeably less likely to be athletes (~1/4) than the overall average on enwiki (~1/3), which surprised me a bit.
Both WD and WP approaches are quite generous in their counting - for example George W. Bush is in a rugby players category and has the occupation listed on WD, so will be in the count by either approach. Granted, he did play for Yale, but it is not quite what we might expect to see! I don't think there is a practical way to work out just 'professional athletes' from either source.
Lastly, baselining this against @Cryptic's figures for footballers, I think there are currently 198,214 mainspace pages transcluding {{Infobox football biography}}, and WD reports 204,265 "occupation: subclass of association football player + on enwiki". So the two approaches don't line up precisely, but are easily within a couple of percent of each other, which is reassuring. Andrew Gray (talk) 12:21, 26 October 2022 (UTC)Reply[reply]
On second thoughts, we can sort of proxy for "professional athlete": of our 97,178 American athletes, a total of 11,308 have a second occupation listed which is not a subclass of athlete, and across all athletes it is 56,340, ~11% in both cases.
This is imperfect in both directions (a lot of "primarily athletes" will also have some other secondary occupation tag, especially for post-retirement careers) but I think it can give us some confidence that we're not getting an excessively high rate of false positives from the generous counting. Andrew Gray (talk) 12:41, 26 October 2022 (UTC)Reply[reply]
Blast from the past: User talk:BrownHairedGirl/Archive/Archive 065#Systemic bias * Pppery * it has begun... 04:16, 14 October 2022 (UTC)Reply[reply]

Frequency of certain phrases in revision edit histories by page type[edit]

Looking to find out how often some profanity, curse words, are said in revision history. I have done the same thing manually for article, article talk, user, user talk. I suspect it is much higher on revision history

Wakelamp d[@-@]b (talk) 15:15, 1 November 2022 (UTC)Reply[reply]

Got any additional details? What word(s) should be searched for? What namespaces? etc. –Novem Linguae (talk) 15:51, 1 November 2022 (UTC)Reply[reply]
Excuse the language, but what about fuck off, cunt, and fuckwit for just one day on the revision history for article? Wakelamp d[@-@]b (talk) 13:51, 4 November 2022 (UTC)Reply[reply]
One day? Yesterday had only this one edit. The whole of October had 28, half "cunts" (with plenty of false positives), half "fuck offs", and no "fuckwits": quarry:query/68602. —Cryptic 14:23, 4 November 2022 (UTC)Reply[reply]
Thank-you. Scunthorpe is an awesome false positive :-) But overall it's far less than I expected.... Maybe people don't attack each other on summary as the other editor won't see it unless they look, or are pinged? Or maybe other words are being used? Also The WMF/ Google- JIGSAW paper, found that 80 % of the abuse was from Ip editors, or very new editors,...
  • So may I have another query that doesn't have a filter on the words, has editor name, and their edit count
  • I will then download the file, import it into Excel to analyze (Excel has a limit of 1.2 m rows, and 2 GB size)
  • The file should be within the limits - en WP has 43K revisions per day? (24 hr/day* 60 min/hr *30 edits per day) which I think is 17 MBish (your 28 records saved as 12 K in Excel so 400 bytes
Wakelamp d[@-@]b (talk) 07:15, 5 November 2022 (UTC)Reply[reply]
On November 4 UTC, there's currently 177248 non-deleted revisions, of which 99916 are currently in mainspace. So your estimate for rowcount is off by about a factor of either 2 or 4; I don't care to guess what the true average size for edit summaries will be.
You want, what? Username/ip, editcount, edit summary? (Edit count for ips isn't easily available the way it is for registered users.) Do you need the page name, a timestamp, and/or a rev_id (for Special:Diff/)? And just in mainspace again? —Cryptic 09:30, 5 November 2022 (UTC)Reply[reply]
See also Scunthorpe problem. Certes (talk) 11:44, 5 November 2022 (UTC)Reply[reply]
I am always happy within a factor of 10 - do not sit on my IKEA furniture :-). - and i expect the distribution of size to strongly skewed left so I agree on not guessing) My plan is to use Excel and keep on introducing filters until I can find 80 %ish. The Scunthorpe problem is a big issue.
  • Edit histories for Article/amin, article talk, user talk as separate files. I am a bit concerned how my laptop will cope as it ha16 b of ram, an the excel files take up more in ram than in storage.
  • Page name (so I can remove comments fuckwit on page fuckwit)
  • Username Edit summary
  • Edit count (which you have advised will not include Ips)
  • minor edit ( just interested to see)
Hypothesis I am trying to check -
1 . Was the WMF research correct that IP editors, and editors with less than 5 comments (I am using edits instead) are responsible for 80 % of abuse?
2. Was the WMF research correct that User talk has 3 times more abusive than artic
3. Was the WMF research correct that a few users create 10 % of the abuse? Wakelamp d[@-@]b (talk) 14:58, 6 November 2022 (UTC)Reply[reply]
Mainspace, Talk:, User talk:. All are for November 5; the first is large, so may take a while to open and sort. —Cryptic 20:04, 6 November 2022 (UTC)Reply[reply]
@Cryptic Thank-you for your help. I am finding some dupes ie,g, , https://en.wikipedia.org/w/index.php?title=Talk:1885_in_Norway&action=history , Wakelamp d[@-@]b (talk) 13:47, 8 November 2022 (UTC)Reply[reply]
I don't understand what you mean? There's only one result for Talk:1885 in Norway (ascending sort by pagename, and it's on page 8), and there's exactly one edit from November 5 in its history. Yes, Nythar made many edits to similarly-named pages that day with the same edit summary, but you said you wanted no filter. —Cryptic 03:08, 9 November 2022 (UTC)Reply[reply]

Living people over some age[edit]

How can I get a list of all the individuals currently alive in the English Wikipedia over the age of say 95?

Thank You Dleigthful (talk) 06:18, 21 November 2022 (UTC)Reply[reply]

@Dleigthful: This is easily done with PetScan instead of a database query. Here's a list of people in Category:1904 births through Category:1927 births (pagepile 46206) that are also in Category:Living people: pagepile 46207 (PetScan query). – Joe (talk) 06:42, 21 November 2022 (UTC)Reply[reply]
@Joe RoeI tried going to this link that you sent here and I can't seem to get it to load. I tried going to your PetScan query above but I can't get it to load for some reason. Am I doing something wrong by calling it by clicking on the link, or is there something else I have to do to get your output. Dleigthful (talk) 06:50, 21 November 2022 (UTC)Reply[reply]
It takes a while to run the query. Click this link to just see the results. – Joe (talk) 07:41, 21 November 2022 (UTC)Reply[reply]
@Dleightful @Joe Roe would you be able to go over how you got this answer? I'm new to PetScan. For instance, how would I change the parameter to say everyone older than 93 years instead. Dleigthful (talk) 01:38, 23 November 2022 (UTC)Reply[reply]
@Dleigthful: So you have to do it in two stages. First enter the list of birth-year categories under categories and select "union" to get a list of all the articles in any of those categories. Under output you can get this list as a pagepile, which you then use as the input of the second query (under other sources) along with Category:Living people. – Joe (talk) 07:01, 29 November 2022 (UTC)Reply[reply]
I'm not familiar with pagepile, but if your description of what it's querying is accurate, it's insufficient - the birth year categories only have articles where the exact year is known. Russ Witte, for example, is in only in Category:1910s births, since it's unknown whether he was born in 1916 or 1917. (It also omits Lucile Randon from Category:1904 births.) quarry:query/69125 has pages in Category:Living people and at least one of Category:1900s births, Category:1910s births, or Category:1900 birthsCategory:1927 births. —Cryptic 07:43, 21 November 2022 (UTC)Reply[reply]
Oh yes, I went from 1904 births because of Randon too, that was a typo. Good point with the uncertain year births, although comparing our results that apparently that only includes nine people :) – Joe (talk) 07:50, 21 November 2022 (UTC)Reply[reply]
A Wikidata query in SPARQL (tutorial) could also select people with a suitable DoB, no death date and an enwp article, though some of them may have died without bothering to inform Wikidata. Certes (talk) 13:14, 21 November 2022 (UTC)Reply[reply]

Displaying all categories in [Category:Container categories] which also have mainspace pages in them.[edit]

This would make finding and fixing these container categories much easier. DemonStalker (talk) 01:58, 23 November 2022 (UTC)Reply[reply]

quarry:query/69189. —Cryptic 02:49, 23 November 2022 (UTC)Reply[reply]