0

I have list of column names of csv file like:[email, null, password, ip_address, user_name, phone_no] .Consider I have csv with data:

03-Sep-14,foo2@yahoo.co.jp,,
20-Jan-13,foo3@gmail.com,,
20-Feb-15,foo4@yahoo.co.jp,,
12-May-16,foo5@hotmail.co.jp,,
25-May-16,foo6@hotmail.co.jp,,

Now I want to identify the column names of this csv file on the basis of data, like col_1 is date and col_2 is mail. I tried to use pandas. like getting all values from col_1 and then identify either it is mail or something else but couldn't get much. i tried something like this:

df = pd.read_csv('demo.csv', header=None)
df[df[1].str.contains("@")]

but its not helping me. thank you.

dar
  • 29
  • 5

2 Answers2

0

Have you tried using Pandas dataframe.infer_objects()?

# importing pandas as pd 
import pandas as pd 

# Creating the dataframe 
df = pd.DataFrame({"A":["alpha", 15, 81, 1, 100], 
                "B":[2, 78, 7, 4, 12], 
                "C":["beta", 21, 14, 61, 5]}) 

# data frame info and data
df.info()
print(df)

# slice all rows except first into a new frame
df_temp = df[1:]

# print it
print(df_temp)
df_temp.info()

# infer the object types
df_inferred = df_temp.infer_objects()

# print inferred
print(df_inferred)
df_inferred.info()

Here's the output from the above py script.

Initially df is inferred as object, int64 and object for A, B and C respectively.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   A       5 non-null      object
 1   B       5 non-null      int64 
 2   C       5 non-null      object
dtypes: int64(1), object(2)
memory usage: 248.0+ bytes
       A   B     C
0  alpha   2  beta
1     15  78    21
2     81   7    14
3      1   4    61
4    100  12     5
     A   B   C
1   15  78  21
2   81   7  14
3    1   4  61
4  100  12   5

After removing the first exception row which has the strings, the data frame is still showing the same type.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 1 to 4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   A       4 non-null      object
 1   B       4 non-null      int64 
 2   C       4 non-null      object
dtypes: int64(1), object(2)
memory usage: 228.0+ bytes
     A   B   C
1   15  78  21
2   81   7  14
3    1   4  61
4  100  12   5

After infer_objects(), the types have been correctly inferred as int64.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 1 to 4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A       4 non-null      int64
 1   B       4 non-null      int64
 2   C       4 non-null      int64
dtypes: int64(3)
memory usage: 228.0 bytes

Is this what you need?

vvg
  • 882
  • 2
  • 15
  • thank you for your ans but i need to identify the column name from provided list ie. ```[date, ip_address, mail, null,.....]``` not the data type. – dar Aug 27 '20 at 14:32
  • I don't understand. Your question is ill-defined. For eg: If you have a billing address and a shipping address -- both in the CSV file, how would you know which is billing and which is shipping, just by looking at data. You just can't. – vvg Aug 27 '20 at 14:37
  • i have only this columns```[email, null, password, ip_address, user_name, phone_no]``` in that i can identify at lest email, null, phone and ip_address. – dar Aug 27 '20 at 14:56
  • Does the CSV file always appear in the same column format or it comes from various sources and the columns get swapped around? If yes, you may not have to do column inference at all. Do you have control over how the CSV files are generated? If yes, you could generate a header and use it to identify columns? – vvg Aug 27 '20 at 15:11
0

The OP has clarified that s/he needs to determine if the column contains one of the following:

  • email
  • password
  • ip_address
  • user_name
  • phone_no
  • null

There are a couple of approaches we could use:

Approach #1: Take a random sample of rows and analyze their column contents using heuristics

We could use the following heuristic rules to identify column content type.

  1. email: Use a regex to check for presence of a valid email.

    [Stackoverflow - How to validate an email address]

    https://www.regular-expressions.info/email.html

    https://emailregex.com/

  2. ip_address: Use a regex to match an ip_address.

    Stackoverflow - Validating IPv4 addresses with Regex

    Stackoverflow - Regular expression that matches valid IPv6 addresses

  3. username: Use a table of common first names or lastnames and search for them within the username

  4. phone_no: Strip +, SPACE, -, (, ) -- alternatively, all special characters. If you are left with all digits, we have a potential phone number

  5. null: All column contents in sample are null

  6. password: If it doesn't satisfy rules 1 through 5, we identify it as password

We should do the analysis independently on each column and keep track of how many sample items in the column matched each heuristic. Then we could pick the classification with the maximum number of matches.

Approach #2: Train a classifier using training data (obtained from real system) and use it to determine column content type

This is a machine learning classification task. A naive approach would be to take each column's data mapped to the content type as the training input.

Using the OP's sample set:

03-Sep-14,foo2@yahoo.co.jp,,
20-Jan-13,foo3@gmail.com,,
20-Feb-15,foo4@yahoo.co.jp,,
12-May-16,foo5@hotmail.co.jp,,
25-May-16,foo6@hotmail.co.jp,,

We would have:

data_content, content_type
03-Sep-14, date
20-Jan-13, date
20-Feb-15, date
12-May-16, date
25-May-16, date
foo2@yahoo.co.jp, email
foo3@gmail.com, email
foo4@yahoo.co.jp, email
foo5@hotmail.co.jp, email
foo6@hotmail.co.jp, email

We can then use machine learning to build a text-to-class multi-class classifier. Some references are given below:

Multi-Class Text Classification from Start to Finish

Multi-Class Text Classification with Scikit-Learn

vvg
  • 882
  • 2
  • 15