1

I'm trying to build a self-contained Jupyter notebook that parses a long address string into a pandas dataframe for demonstration purposes. Currently I'm having to highlight the entire string and use pd.read_clipboard:

data = pd.read_clipboard(f,
                  comment='#', 
                  header=None, 
                  names=['address']).values.reshape(-1, 2)

matched_address = pd.DataFrame(data, columns=['addr_zagat', 'addr_fodor'])

I'm wondering if there is an easier way to read the string in directly instead of relying on having something copied to the clipboard. Here are the first few lines of the string for reference:

f = """###################################################################################################
# 
#   There are 112 matches between the tuples.  The Zagat tuple is listed first, 
#   and then its Fodors pair.
#
###################################################################################################

Arnie Morton's of Chicago 435 S. La Cienega Blvd. Los Angeles 90048 310-246-1501 Steakhouses

Arnie Morton's of Chicago 435 S. La Cienega Blvd. Los Angeles 90048 310/246-1501 American
########################

Art's Deli 12224 Ventura Blvd. Studio City 91604 818-762-1221 Delis

Art's Delicatessen 12224 Ventura Blvd. Studio City 91604 818/762-1221 American
########################

Bel-Air Hotel 701 Stone Canyon Rd. Bel Air 90077 310-472-1211 Californian

Hotel Bel-Air 701 Stone Canyon Rd. Bel Air 90077 310/472-1211 Californian
########################

Cafe Bizou 14016 Ventura Blvd. Sherman Oaks 91423 818-788-3536 French Bistro

Cafe Bizou 14016 Ventura Blvd. Sherman Oaks 91423 818/788-3536 French
########################
h Bistro

Cafe Bizou 14016 Ventura Blvd. Sherman Oaks 91423 818/788-3536 French
########################"""

Does anybody have any tips as to how to parse this string directly into a pandas dataframe?

I realise there is another question that addresses this here: Create Pandas DataFrame from a string but the string is delimited by a semi colon and totally different to the format used in my example.

Sam Comber
  • 861
  • 1
  • 9
  • 24

1 Answers1

1

You should add an example of what your output should look like but generally, I would suggest something like this:

import pandas as pd
import numpy as np
# read file, split into lines
f = open("./your_file.txt", "r").read().split('\n')
accumulator = []
# loop through lines
for line in f:
    # define criteria for selecting lines
    if len(line) > 1 and line[0].isupper():
        # define criteria for splitting the line
        # get name
        first_num_char = [c for c in line if c.isdigit()][0]
        name = line.split(first_num_char, 1)[0]
        line = line.replace(name, '')
        # get restaurant type
        rest_type = line.split()[-1]
        line = line.replace(rest_type, '')
        # get phone number
        number = line.split()[-1]
        line = line.replace(number, '')
        # remainder should be the address
        address = line
        accumulator.append([name, rest_type, number, address])
# turn accumulator into numpy array, pass with column index to DataFrame constructor
df = pd.DataFrame(np.asarray(accumulator), columns=['name', 'restaurant_type', 'phone_number', 'address'])
eva-vw
  • 622
  • 2
  • 10