2

I am new to Python and Jupyter Notebook and I am currently following this tutorial: https://www.dataquest.io/blog/jupyter-notebook-tutorial/. So far I've imported the pandas library and a couple other things, and I've made a data frame 'df' which is just a CSV file of company profit and revenue data. I'm having trouble understanding the following line of the tutorial:

non_numberic_profits = df.profit.str.contains('[^0-9.-]')

I understand the point of what the tutorial is doing: identifying all the companies whose profit variable contains a string instead of a number. But I don't understand the point of [^0-9.-] and how the above function actually works.

My full code is below. Thanks.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid")

df = pd.read_csv('fortune500.csv')
df.columns = ['year', 'rank', 'company', 'revenue', 'profit']
non_numberic_profits = df.profit.str.contains('[^0-9.-]')
df.loc[non_numberic_profits].head()
Brad Solomon
  • 29,156
  • 20
  • 104
  • 175
Leonidas
  • 391
  • 1
  • 12

1 Answers1

3

The expression [^0-9.-] is a so-called regular expression, which is a special text string for describing a search pattern. With regular expressions (or in short 'RegEx') you can extract specific parts of a string. For example, you can extract foo from the string 123foo456.

In RegEx, when using [] you define a range of characters that has to be matched. For example, [bac] matches abc in the string abcdefg. [bac] could also be rewritten as [a-c].

Using [^] you can negate a character range. Thus, the RegEx [^a-c] applied to the above example would match defg.

Now here is a catch:
Since ^ and - have a special meaning when used in regular expressions, they have to be put in specific positions within [] in order to be matched literally. Specifically, if you want to match - literally and you want to exclude it from the character range, you have to put it at the rightmost end of [], for example [abc-].

Putting it all together
The RegEx '[^0-9.-]' means: 'Match all substrings that do not contain the digits 0 through 9, a dot (.) or a dash (-)'. You can see your regular expression applied to some example strings here.

The pandas function df.profit.str.contains('[^0-9.-]') checks whether the strings in the profit column of your DataFrame match this RegEx and returns True if they do and False if they don't. The result is a pandas Series containing the resulting True/False values.


If you're ever stuck, the Pandas docs are your friend. Stack Overflow's What Does this Regex Mean? and Regex 101 are also good places to start.

Milo
  • 2,372
  • 1
  • 15
  • 21