0

After reading in csv file which features a number within its filename, I would like to create a column in the dataframe called 'station_no' and populate each row with the number that forms part of the filename. The typical file format name can be seen below.

For example:

air_temp_df <- read_csv('data/2015.AT[AirTemperature].csv.zip', skip = 15) %>% mutate(station_no = XXXXXX)

where xxxx is somewhere I would like the station_no column to be populated with multiple rows of 2015.

Thanks.

Drew
  • 75
  • 6
  • This question will be helpful: https://stackoverflow.com/questions/52722846/how-can-i-remove-non-numeric-characters-from-strings-using-gsub-in-r/52729957 – bouncyball Feb 27 '20 at 15:22

1 Answers1

1

Solution using base R:

file_path <- 'data/2015.AT[AirTemperature].csv.zip'
air_temp_df <- read_csv(file_path, skip = 15)
air_temp_df$station_no <- gsub("^.*/([0-9]{4})\\..*", "\\1", file_path)

We can use regex to extract the four digits year from the file_path string.

The regex identifies : start of line (^) followed by 'something or nothing' (.*) followed by a "/" (/) followed by a group of four numbers that we want to use later (marked by round parentheses) ([0-9]{4}) followed by a dot (\\.) followed by something or nothing (.*).

Then we replace the whole string with only the first (an in this case only) group (\\1)

This solution should also work if there are other numbers in the file_path (Using "[^0-9]" would not work in these cases)

dario
  • 4,863
  • 1
  • 9
  • 23
  • Many thanks. How would one do this if the integers were to vary in the filename? Is there some code that could pull out a particular part of the filename. I'm thinking for ''data/2015.AT[AirTemperature].csv.zip' to extract entries between / and .AT – Drew Feb 27 '20 at 15:42
  • What do you mean by `vary`? That they are not always four digits long? Or that there may be other, non numeric characters? – dario Feb 27 '20 at 15:45
  • 1
    With regex it's always important to be as strict as possible, otherwise it's just another footgun. If we only want to extract digits (numbers) of different length at this position we should use: gsub("^.*/([0-9]+)\\..*", "\\1", file_path)If we want to extract *anything* (i.e. numbers, letters, punctuation) we can use ``extract entries between / and .AT` -> gsub("^.*/(.*)\\.AT.*", "\\1", 'data/2015.AT[AirTemperature].csv.zip')` – dario Feb 27 '20 at 15:48