-3

I am working with a csv data set with around 1 million records. I need to perform two operations on the data set:

  1. Prepare a dataset that do not have those rows that have some missing (blank) values in them.
  2. Prepare another data set that replaces empty values with unknown.

I have tried to use excel for it but that is taking too much time. Please someone help with the way it can be done in R?

Subham Tripathi
  • 2,513
  • 6
  • 36
  • 65
  • 1
    A good question would include a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with some minimal data that represents your real data and gives the desired output. – MrFlick Jun 15 '15 at 05:13
  • 1
    Please post what you already tried in R – lanenok Jun 15 '15 at 05:13
  • @MrFlick, i am afraid i can't put data here as its against company policy, though i can share details about nature of data . What details are you seeking for ? – Subham Tripathi Jun 15 '15 at 05:21
  • @MichaelVE unfortunately I don't think this dupe target works because the OP has asked two questions -- how to remove rows with missing values (which your dupe covers) and how to replace them with some other value (which your dupe doesn't cover). – josliber Jun 15 '15 at 05:55
  • @SubhamTripathi The trick is to make a reproducible sample which is possible without showing your own data. You can just make a "fake" dataset that contain some rows with empty cells so that we can mimic your problem. – MichaelVE Jun 15 '15 at 23:53

1 Answers1

2

To get complete cases, use this:

complete_df <- df[complete.cases(df),]

complete.cases returns a logical vector that tells you which rows of dataframe df are complete, and you can use that to subset the data.

To replace the NAs, you can use this:

new_df <- df
new_df[is.na()] <- 'Unknown'

But this has the effect of possibly changing the datatypes of the columns with missing data. For example, if you have a column of numeric data and you put the missing variables as 'Unknown' then that whole column is now a character variable, so be aware of this.

goodtimeslim
  • 850
  • 6
  • 12
  • but i dont have NA in my data set , i have missing cells that contains nothing , will these cells be treated as NA automatically. – Subham Tripathi Jun 15 '15 at 05:26
  • It depends on how you're loading in the data. If you're using something like `read.csv`, there is an option to make those `NA`.` read.csv(my_data, na.strings = ' ')` will say any field that is only a single space is a missing value. If you have multiple things that represent missing values, you can do something like this: `read.csv(my_data, na.strings = c(' ', 'NA'))`. If your data is in CSV, you can open up the raw file in notepad and see what it is, or you can just call a missing entry and it will show it to you (i.e. `df$column_name[4]` will show " ") – goodtimeslim Jun 15 '15 at 05:32
  • @SubhamTripathi unfortunately without providing a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) in your question we are only guessing about the structure of your data and what code will work for you. – josliber Jun 15 '15 at 05:33