Good news: 44,583 usable responses!

Bad news: Responses alone use 4,592,255 cells on Google Sheets, which is 92% of the 5m cell limit, before I even start writing formulae to count everything!

Good news: I've split it into one Sheet per question, and the data is now processed!

Bad news: This means I will definitely have to (*gulp*) learn new software and maybe even upgrade my computer in order to handle next year's responses, assuming they increase at a similar rate.

Good news: Maybe tomorrow I can start writing up the worldwide report for 2021??

Show thread

Please note, this is very optimistic, I may not manage it because I have a LOT of fatigue today. On the other hand, it might be a pretty effective way to keep me resting on the sofa. 🤔

Show thread

I wish I could somehow crowdfund "Cassian learns a new software" - no amount of money will make my brain do the thing, you know?

Show thread

On that note! Statisticians, can you help out?

I'm going to need some software that can handle a LOT of calculations on a fairly average computer, involving probably 50,000 survey responses. The most important factor is user-friendliness.

Any recommendations?

Money is a secondary consideration to user-friendliness, it doesn't have to be free - if I can't learn how to use it without going to uni then the Gender Census is kaput, y'know? So if it has to cost money to be solo-learnable (a real phrase) then I may have to crowdfund.

Show thread

Another option might be... to limit the survey to 40,000 responses per year? 🤨 (Very much not what I want to do.)

Show thread

Please use PostgreSQL @gendercensus !
It'll do 50,000 calculations instantly an average computer, or at max 10 secs on a toaster, and save you lots of headaches in the future!

@foxxy Sounds good so far. Can it be used without any coding experience at all?

@gendercensus @foxxy Not particularly. Though depending on how you wanna do it @foxxy and I might be able to assist.

@doxxy @foxxy That's promising! Thank you, I will make a note to look into it when it's not late o'clock!

@gendercensus @doxxy @foxxy I too would be willing to extend assistance there!

One thing I'd point out here, philosophically, is that Excel is a program designed to trick you into thinking you're not "coding" when you use it. SQL is much more nakedly 'code', but it's very much the same type of code as Excel formulas; more of your current skills will transfer than you might think.

@Nentuaby @gendercensus @doxxy @foxxy SQL for a statistics problem is very much a "if I have a hammer, everything looks like a nail."

Vanilla SQL doesn't do much in terms of visualization, summary statistics, or hypothesis testing. And since smartsurvey delivers results in CSV and SPSS, there's little reason to go into SQL for this sort of thing.

@cbrachyrhynchos @Nentuaby @gendercensus @foxxy This is probably why r lang and julia was mentioned. Lots of tools there for easy data visualization.

@cbrachyrhynchos @Nentuaby @gendercensus @doxxy @foxxy This. I spend a fair bit of time at work banging on SQL to make it do stuff that everyone wishes we could just do in R, it's bad, you should not do data analysis with SQL unless a whole-ass extremely stubborn corporate IT bureaucracy has tied your hands to the keyboard to make you.

It will also be much easier to download RStudio than to set up a postgres server.

@cbrachyrhynchos @Nentuaby @gendercensus @doxxy @foxxy That said, even though R is a fine way to start learning to code, with a lot of tutorials available geared very specifically towards people with no coding experience who want to analyze some data... it is still coding.

MS Excel will support the amount of data you're talking about.

@mcmoots @cbrachyrhynchos @Nentuaby @doxxy @foxxy

"MS Excel will support the amount of data you're talking about."

It sounds like I probably just need to get a computer that can cope with it then, rather yhsn it being a software issue. Thanks for all the info!

@gendercensus @cbrachyrhynchos @Nentuaby @doxxy @foxxy I mean, R would be absolutely fine with that much data on a crappy old machine. So if you aren't able to source a new computer that's still an option. But yeah, sometimes the cheapest way to pay for something is with money.

@mcmoots @cbrachyrhynchos @Nentuaby @doxxy @foxxy The most important factor is user-friendliness, I have zero experience with code and I'm not able to learn it on demand solo, so going with a program that's more familiar is more likely to succeed I think! :)

@gendercensus Echoing a lot of the replies here. 50,000 unique identifiers, 5M data points is really trivial for most decent computers if managed correctly. I recommend not using excel or libre office as it'll be frustrating. You will want to use postgresql, R, or Julia as recommended. If you literally have zero experience with these languages I recommend taking up people's offers to help.

My experience is in R, and it literally is just a few lines of code to get the data loaded in a format that will allow exploration through many different statistical lenses. R was designed specifically for statistical analysis, and slicing and dicing large data sets (orders of magnitude larger than what you are working with). There are lots of packages that enable pretty sophisticated analyses with a single line of code.

I am also willing to assist. You've done the hard part of creating the survey, marketing it, and getting responses. Let others help <3

@jqiriazi When you say Excel would be comparatively frustrating - why is that, in your experience?

@gendercensus It'll be fine when doing simple distributions for each individual question as 50,000 rows per sheet is manageable. But, when you want to create statistics across multiple questions to tease out potentially significant and actionable nuances (e.g., x% of age group y represent the majority of answer a in question 3), the spreadsheet format and interface will start to be a barrier, and increase the likelihood of errors.

I have experience working with large complex spreadsheets, and I find them very challenging to debug (I'm currently doing this for a spreadsheet with 96 sheets, 127,141 cells, 56,917 formulas, and 864 charts so I have experience here).

R allows analysis of statistically significant correlations across any number of questions with literally 1 - 5 lines of code. It's much easier to debug. I will also claim that, for statistics, you'll have many more options and much more confidence in the results using R.

@jqiriazi Thanks for the detailed explanation, much appreciated! :)

@gendercensus LibreOffice claims that it can handle over a million rows and a billion cells per worksheet. It's probably the easiest free software coming from an MSExcel/Google Sheets background.

R will crunch the data easily on a small computer, but isn't remotely user-friendly.

@gendercensus There is R Commander which might help out if you go the R route.

@gendercensus And if money is negotiable compared to UI, both Google Sheets and LibreOffice Calc use the same formula language as MS Excel (which probably originated in a completely different software package).

@naga @jessmahler @gendercensus
The typical recommendation would be SPSS or pspp but I honestly don't know from the above if it matches the requirements. On SPSS I've worked on tables with about 1k variables and 130k respondents.

@danielscardoso @jessmahler @gendercensus Yeah, if SPSS is affordable. I went with Jamovi to avoid recommending R itself, plus the spreadsheet interface.

Sign in to participate in the conversation

We are a Mastodon instance for LGBT+ and allies!