-1

I download a CSV file from the internet which contains information about the file in the first row and headers begin on the second row. If I open this CSV file in Excel, cell I1 has Viewing=[12/15/19 - 12/21/19]. I would like to extract the first date in this cell 2019-12-15 and create a new column with this date that runs all the way down.

df before (z-r contain information about the file a-j are actual column headers):

z,y,x,w,v,u,t,s,Viewing=[12/15/19 - 12/21/19],r
a,b,c,d,e,f,g,h,i,j
1,2,3,4,5,6,7,8,9,10

df expected after (information rowgone):

date,a,b,c,d,e,f,g,h,i,j
2019-12-15,1,2,3,4,5,6,7,8,9,10

How can I extract that specific value (2019-15-19 in this case)? I know that once I get it I can do df['date'] = date_extracted. My guess is that this is easiest done with regex, but I have almost no knowledge of it.

kindofhungry
  • 129
  • 11
  • 1
    It looks like you are looking to create a regex, but do not know where to get started. Please check [Reference - What does this regex mean](https://stackoverflow.com/questions/22937618) resource, it has plenty of hints. Also, refer to [Learning Regular Expressions](https://stackoverflow.com/questions/4736) post for some basic regex info. Once you get some expression ready and still have issues with the solution, please edit the question with the latest details and we'll be glad to help you fix the problem. – Wiktor Stribiżew Dec 04 '20 at 10:07
  • 1
    Questions that ask ["Give me a regex that does X"](https://meta.stackoverflow.com/q/285733) with no attempt are off topic on Stack Overflow. Also, see [Why is “Can someone help me?” not an actual question?](https://meta.stackoverflow.com/questions/284236) – Wiktor Stribiżew Dec 04 '20 at 10:07

2 Answers2

1

If your formats are going to be as consistent as that, you could get away without using regex:

line1 = 'z,y,x,w,v,u,t,s,Viewing=[12/15/19 - 12/21/19],r'

extracted_date = line1.split('[')[1].split(' -')[0]
print(extracted_date)

Output:

12/15/19

If you want the regex option:

extracted_date = re.findall('([0-9/]{6,})', line1)[0]

From outside in: ( ... ) is the capture group, i,e, the part we want to extract. [0-9/] is to define which characters are acceptable. {6,} means at least 6 characters are required (in case your date looks like 1/1/19). The result is a list of all the matches, so pick the first one using [0].

Benedictanjw
  • 708
  • 1
  • 5
  • 14
0

Use base string split method to get you specific subscting

In [2]: df = pd.DataFrame(columns=['a','b','c','d','e','f','g','h','i','j'])
   ...: df.loc[0] = "z,y,x,w,v,u,t,s".split(",") + ["Viewing=[12/15/19 - 12/21/19]","r"]
   ...: df.loc[1] = list(range(10))
   ...: df
Out[2]: 
   a  b  c  d  e  f  g  h                              i  j
0  z  y  x  w  v  u  t  s  Viewing=[12/15/19 - 12/21/19]  r
1  0  1  2  3  4  5  6  7                              8  9

In [3]: df['New_col'] = df['i'].str.split('[').str[1].str.split().str[0].iloc[0]

In [4]: df
Out[4]: 
   a  b  c  d  e  f  g  h                              i  j   New_col
0  z  y  x  w  v  u  t  s  Viewing=[12/15/19 - 12/21/19]  r  12/15/19
1  0  1  2  3  4  5  6  7                              8  9  12/15/19
Amir saleem
  • 695
  • 5
  • 9