Why can't I drop the spurious 0 from this data set?

Question

I'm using C++ in Visual Studio to create a Windows Console application that will compute the means of X, 1/X and ln X for the positive values of X stored in the first n A-column cells of a CSV. My strategy is push_backing the cells' contents into a vector, then summing values dependent on the vector's entries (e.g. reciprocals for 1/X) and dividing the result by the vector length. The mean of X is slightly smaller than it should be, indicating the vector length is 1 more than the sample size and the extra entry is zero. This also explains why the means of 1/X and -ln X are respectively inf and -inf. So in theory one solution is to pop_back the vector before computing statistics. Unfortunately, I've tried this and countless other methods and nothing else works.

I'll conclude by copy-pasting a minimal example of the code, and listing things I've tried. (If you think one of those methods "should" work and I probably flubbed their execution, please check it does before posting, because this program has been surprisingly stubborn for the past 3 days.) The CSV I used is here.

#include "stdafx.h"
#include <iostream>
#include <fstream>
#include <string>
#include <algorithm>
#include <numeric>
#include <stdio.h>
#include <math.h>
#include <vector>

int main()
{
    std::ifstream file("Example.csv");
    double value;
    std::string valuetmp;
    std::vector<double> dataset;
    dataset.clear();
    while (file.good())
    {
        getline(file, valuetmp);
        value = ::atof(valuetmp.c_str());
        dataset.push_back(value);
    }
    int n = dataset.size();

    int i;
    double sigmaxi;
    sigmaxi = 0;
    for (i = 0; i < n; i++) {
        sigmaxi += dataset[i];
    }
    double meanxi;
    meanxi = sigmaxi / (1.0*n);

    double sigma1overxi;
    sigma1overxi = 0;
    for (i = 0; i < n; i++) {
        sigma1overxi += 1.0 / dataset[i];
    }
    double mean1overxi;
    mean1overxi = sigma1overxi / (1.0*n);

    double sigmalnxi;
    sigmalnxi = 0;
    for (i = 0; i < n; i++) {
        sigmalnxi += log(dataset[i]);
    }
    double meanlnxi;
    meanlnxi = sigmalnxi / (1.0*n);

    std::cout << "The mean of X is ";
    std::cout << meanxi << ", whereas the mean of 1/X is ";
    std::cout << mean1overxi << ", and the mean of ln X is ";
    std::cout << meanlnxi << ". \n";

    std::cout << "Press ENTER to close.";
    std::cin.ignore(std::numeric_limits<std::streamsize>::max(), '\n');
    return 0;
}

I've tried:

The erase-remove idiom;

Changing the push_back rule so that only non-zero entries are added to dataset;

Renaming the original vector baddataset, then defining dataset as the first baddataset.size()-1 entries as baddataset;

Writing int n = dataset.size()-1; (or various syntactical variants thereof, but nothing seems to convey the “1 less than that” instruction);

Writing int n = dataset.size(); then using n = n-1; or n--; to reduce n by 1;

Writing int badn = dataset.size(); int n = badn-1; (you can see how desperate I’m getting);

Replacing for(i = 0; i < n; i++) during summations with for(i = 0; i < n-1; i++), then dividing by n-1 at the end instead of n;

Stating that only nonzero entries are manipulated to add something (e.g. their logarithm) to the sum being computed;

Defining oldsigma1overxi etc., making sure these "old" sums store the other sums' previous values, and resetting sums to their "old" values if they become inf or nan (this doesn't fix the division-by-the-wrong-n problem, but it'd be something);

Changing the functions to approximations (I'll eventually work with a data set of large numbers from 40,000 to 6,000,000,000) that don't diverge at 0 (I haven't a clue why it doesn't calculate when I do that).

Possible duplicate of [Why is iostream::eof inside a loop condition considered wrong?](http://stackoverflow.com/questions/5605125/why-is-iostreameof-inside-a-loop-condition-considered-wrong) — user657267, Jan 31 '16 at 11:22
Does the problem disappear when you check the [return value of `getline`](http://man7.org/linux/man-pages/man3/getline.3.html) and `break` if it returns -1? — Jongware, Jan 31 '16 at 11:33
@Jongware would that mean editing the while loop to while (file.good()) { getline(file, valuetmp); if (getline == -1){ break; } if (getline != -1){ value = ::atof(valuetmp.c_str()); dataset.push_back(value); } } ? — J.G., Jan 31 '16 at 11:39
No it would definitely not, you use `getline` twice there. Get rid of that first one (the one where you do not check the return value). The reason this could be the problem is that `file.good` possibly does not report EOF until you *actively read from the file* (which only happens after the check), and the next `getline` hands down an empty line to `atof`, which it interprets as `0.0` ([description of `atof`](http://en.cppreference.com/w/cpp/string/byte/atof#Return_value)). — Jongware, Jan 31 '16 at 11:43
Ah – I see that could be the reason why @user657267 suggested this as a duplicate (in a quite oblique way, if I may say so). — Jongware, Jan 31 '16 at 11:44
@Jongware I have an error whereby getline isn't recognised as a declared variable, so I assume I have to write something more complicated than getline for the variable I'm comparing to -1. My previous comment wrote the while loop incorrectly; how should it read instead? — J.G., Jan 31 '16 at 11:45
@Jongware What if statement should be used to check whether the return value of getline is -1? — J.G., Jan 31 '16 at 11:55

score 2 · Accepted Answer · answered Jan 31 '16 at 12:02

As already said in the comments, the problem is in your while-loop: file.good() only evaluates to false after getline() has tried unsuccesfully to read past the end of the file. Luckily you don't even need file.good(), since the return value of getline() evaluates to false at end-of-file. That makes for simpler code than your original:

while (getline(file, valuetmp))
{
    value = ::atof(valuetmp.c_str());
    dataset.push_back(value);
}

That said, your workaround of using int n = dataset.size() - 1 also works for me; I don't understand why it wouldn't work for you. You could also use dataset.pop_back() right after the while loop. Best is to use the correct input code of course.

To diagnose problems like this, it's often a good idea to dump the contents of dataset to screen or to file, possible using an abridged version of your data. That often quickly pinpoints the nature of the problem.

I also don't know why neither int n = dataset.size() - 1 nor dataset.pop_back() worked for me, but your loop simplification does; thank you! — J.G., Jan 31 '16 at 12:05

Why can't I drop the spurious 0 from this data set?

1 Answers1