I'm using C++ in Visual Studio to create a Windows Console application that will compute the means of X, 1/X and ln X for the positive values of X stored in the first n A-column cells of a CSV. My strategy is push_backing the cells' contents into a vector, then summing values dependent on the vector's entries (e.g. reciprocals for 1/X) and dividing the result by the vector length. The mean of X is slightly smaller than it should be, indicating the vector length is 1 more than the sample size and the extra entry is zero. This also explains why the means of 1/X and -ln X are respectively inf and -inf. So in theory one solution is to pop_back the vector before computing statistics. Unfortunately, I've tried this and countless other methods and nothing else works.
I'll conclude by copy-pasting a minimal example of the code, and listing things I've tried. (If you think one of those methods "should" work and I probably flubbed their execution, please check it does before posting, because this program has been surprisingly stubborn for the past 3 days.) The CSV I used is here.
#include "stdafx.h"
#include <iostream>
#include <fstream>
#include <string>
#include <algorithm>
#include <numeric>
#include <stdio.h>
#include <math.h>
#include <vector>
int main()
{
std::ifstream file("Example.csv");
double value;
std::string valuetmp;
std::vector<double> dataset;
dataset.clear();
while (file.good())
{
getline(file, valuetmp);
value = ::atof(valuetmp.c_str());
dataset.push_back(value);
}
int n = dataset.size();
int i;
double sigmaxi;
sigmaxi = 0;
for (i = 0; i < n; i++) {
sigmaxi += dataset[i];
}
double meanxi;
meanxi = sigmaxi / (1.0*n);
double sigma1overxi;
sigma1overxi = 0;
for (i = 0; i < n; i++) {
sigma1overxi += 1.0 / dataset[i];
}
double mean1overxi;
mean1overxi = sigma1overxi / (1.0*n);
double sigmalnxi;
sigmalnxi = 0;
for (i = 0; i < n; i++) {
sigmalnxi += log(dataset[i]);
}
double meanlnxi;
meanlnxi = sigmalnxi / (1.0*n);
std::cout << "The mean of X is ";
std::cout << meanxi << ", whereas the mean of 1/X is ";
std::cout << mean1overxi << ", and the mean of ln X is ";
std::cout << meanlnxi << ". \n";
std::cout << "Press ENTER to close.";
std::cin.ignore(std::numeric_limits<std::streamsize>::max(), '\n');
return 0;
}
I've tried:
The erase-remove idiom;
Changing the push_back rule so that only non-zero entries are added to dataset;
Renaming the original vector baddataset, then defining dataset as the first baddataset.size()-1
entries as baddataset;
Writing int n = dataset.size()-1;
(or various syntactical variants thereof, but nothing seems to convey the “1 less than that” instruction);
Writing int n = dataset.size();
then using n = n-1; or n--; to reduce n by 1;
Writing int badn = dataset.size(); int n = badn-1;
(you can see how desperate I’m getting);
Replacing for(i = 0; i < n; i++)
during summations with for(i = 0; i < n-1; i++)
, then dividing by n-1 at the end instead of n;
Stating that only nonzero entries are manipulated to add something (e.g. their logarithm) to the sum being computed;
Defining oldsigma1overxi etc., making sure these "old" sums store the other sums' previous values, and resetting sums to their "old" values if they become inf or nan (this doesn't fix the division-by-the-wrong-n problem, but it'd be something);
Changing the functions to approximations (I'll eventually work with a data set of large numbers from 40,000 to 6,000,000,000) that don't diverge at 0 (I haven't a clue why it doesn't calculate when I do that).