0

Normally, when we read a csv file in R, the spaces are automatically converted to '.'

> df <- read.csv("report.csv")
> str(df)
'data.frame':   598 obs. of  61 variables:
 $ LR.Number   
 $ Vehicle.Number   

However, when we read the same csv file in sparkR, the space remains intact and is not handled implicitly by spark

#To read a csv file
df <- read.df(sqlContext, path = "report.csv", source = "com.databricks.spark.csv", inferSchema = "true", header="true")
printSchema(df)

root
 |-- LR Number: string (nullable = true)
 |-- Vehicle Number: string (nullable = true)

Because of this, performing any activity with the column causes a lot of trouble and need to be call like this

head(select(df, df$`LR Number`))

How can I explicitly handle this? How can sparkR implicitly handle this.

I am using sparkR 1.5.0 version

Hardik Gupta
  • 4,214
  • 6
  • 27
  • 64
  • not sure if this will help, as delimiter means how each columns are separated. Here columns are "," separated and each column has a space. So I have col1,col2 where col1 is 'name1 name2' , col2 is 'name1 name2' – Hardik Gupta Dec 16 '16 at 11:26
  • this feature in normal R gets automatically handled, try reading a csv file which has column names with spaces, there R automatically inserts a dot (.) – Hardik Gupta Dec 16 '16 at 11:27
  • Set `header = "true"` and `inferSchema = "false"` to skip the names to have it use built-in ones or `selectExpr()` which supports using `col_name AS new_col_name` as seen in this python example: http://stackoverflow.com/a/34077809/1457051 (which shld be straightforward to extrapolate from). I'm running `sparklyr` with spark 2.x when I use spark so I'm not keen to test `sparkR` with an older spark version. You can also use `sql()` to import the CSV with SQL as shown in the "SQL" section of https://github.com/databricks/spark-csv (you can change col names that way in the `CREATE TABLE` call). – hrbrmstr Dec 16 '16 at 12:01
  • `header = "true" and inferSchema = "false"` doesn't works – Hardik Gupta Dec 16 '16 at 12:54
  • then you're stuck at the SQL level or potentially upgrading spark & sparkR to see if newer builds pass on the parameters. If you aren't using sparkR for the ML components but just for the "big data" capabilities take a look at Apache Drill. If all else fails, you can load the CSV in via spark SQL outside of R. – hrbrmstr Dec 16 '16 at 13:57
  • upgrading spark is no option.. I have to live with 1.5.0 only (since am using cloudera cdh 1.5.0 as well ) – Hardik Gupta Dec 16 '16 at 14:15

2 Answers2

1

As a work around you could use the following piece of psuedo code

colnames_df<-colnames(df)
colnames_df<-gsub(" ","_",colnames_df)

colnames(df)<-colnames_df

Another solution is to save file somewhere and read using read.df()

Mohit Bansal
  • 131
  • 8
1

Following worked for me

df = collect(df)
colnames_df<-colnames(df)
colnames_df<-gsub(" ","_",colnames_df)
colnames(df)<-colnames_df
df <- createDataFrame(sqlContext, df)
printSchema(df)

Here we need to locally collect the data first, which will convert spark data frame to normal R data frame. I am sceptical whether this is a good solution as I don't want to call collect. However I investigated and found that even to use ggplot libraries we need to convert this into a local data frame

Hardik Gupta
  • 4,214
  • 6
  • 27
  • 64