Is there an R or Python function for separating information in non-delimited strings, where the information varies?

Question

I am currently cleaning up a messy data sheet in which information is given in one excel cell where the different characteristics are not delimited (no comma, spaces are random). Thus, my problem is to separate the different information without a delimitation I could use in my code (can't use a split command)

I assume that I need to include some characteristics of each part of information, such that the corresponding characteristic is recognized. However, I don't have a clue how to do that since I am quite new to Python and I only worked with R in the framework of regression models and other statistical analysis.

Short data example: INPUT:

"WMIN CBOND12/05/2022 23554132121"

or

"WalMaInCBND 12/05/2022-23554132121"

or

"WalmartI CorpBond12/05/2022|23554132121"

EXPECTED OUTPUT:

"Walmart Inc.", "Corporate Bond", "12/05/2022", "23554132121"

So each of the "x" should be classified in a new column with the corresponding header (Company, Security, Maturity, Account Number)

As you can see the input varies randomly but I want to have the same output for each of the three inputs given above (I have over 200k data points with different companies, securities etc.)

First Problem is how to separate the information effectively without being able to use a systematic pattern.

Second Problem (lower priority) is how to identify the company without setting up a dictionary with 50 different inputs for 50k companies.

Thanks for your help!

Could you add more examples? I feel like you could split by the space, and then separate the date from the text to achieve the 4 columns. After that you will have to provide some way of transforming those initials into the full brand. — Adirio, Aug 27 '19 at 07:38

boski · Answer 1 · 2019-08-27T09:06:00.033

0

edit

s = c("WMIN CBOND12/05/2022 23554132121",
      "WalMaInCBND 12/05/2022-23554132121",
      "WalmartI CorpBond12/05/2022|23554132121")


ID = gsub("([a-zA-Z]+).*","\\1",s)
ID2 = gsub(".* ([a-zA-Z]+).*","\\1",s)
date = gsub("[a-zA-Z ]+(\\d+\\/\\d+\\/\\d+).*","\\1",s)
num = gsub("^.*[^0-9](.*$)","\\1",s)

data.frame(ID=ID,ID2=ID2,date=date,num=num,stringsAsFactors=FALSE)

           ID                                ID2       date         num
1        WMIN                              CBOND 12/05/2022 23554132121
2 WalMaInCBND WalMaInCBND 12/05/2022-23554132121 12/05/2022 23554132121
3    WalmartI                           CorpBond 12/05/2022 23554132121

Works for cases 1 and 3 but I haven't figured out a logic for the second case, how can we know where to split the string containing the company and security if they are not separated?

edited Aug 27 '19 at 09:06

answered Aug 27 '19 at 07:53

boski

2,317
1
10
27

Thanks for your suggestions! I cannot separate the data by spaces. Maybe I should've given you more different examples, sorry for that. The data structure varies randomly (I have over 200k data points). Sometimes I get all the information without any spaces or different delimitation. So I can't use a systematic pattern of delimitation. E.g. "WMINCBOND 12/05/202223554132121" or "WMIN C BD-12/05/2022 23554132121" – nik_126 Aug 27 '19 at 08:01
@nik_126 you should update your question with a few more examples that give a clear picture of the different cases that can be encountered, it will make it easier to help you! – boski Aug 27 '19 at 08:20

Adirio · Answer 2 · 2019-08-27T08:05:54.050

Using python and regular expressions:

import re

def make_filter(pattern):
    pattern = re.compile(pattern)
    def filter(s):
        filtered = pattern.match(s)
        return filtered.group(1), filtered.group(2), filtered.group(3), filtered.group(4)
    return filter

filter = make_filter("^([a-zA-Z]+)\s([a-zA-Z]+)(\d+/\d+/\d+)\s(\d+)$")

filter("WMIN CBOND12/05/2022 23554132121")

The make_filter function is just an utility to allow you to modify the pattern. It returns a function that will filter the output according to that pattern. I use it with the "^([a-zA-Z]+)\s([a-zA-Z]+)(\d+/\d+/\d+)\s(\d+)$" pattern that considers some text, an space, some text, a date, an space, and a number. If you want to kodify this pattern provide more info about it. The output will be ("WMIN", "CBOND", "12/05/2022", "23554132121").

I think you're missing a `return filter` at the end of `make_filter`. — tfw, Aug 27 '19 at 08:02

score 0 · Answer 3 · answered Aug 27 '19 at 07:54

I recommend to first introduce useful seperators where possible and construct a dictionary of replacements for processing with regular expressions.

import re
s = 'WMIN CBOND12/05/2022 23554132121'

# CAREFUL this not a real date regex, this should just
# illustrate the principle of regex
# see https://stackoverflow.com/a/15504877/5665958 for
# a good US date regex
date_re = re.compile('([0-9]{2}/[0-9]{2}/[0-9]{4})')

# prepend a whitespace before the date
# this is achieved by searching the date within the string
# and replacing it with itself with a prepended whitespace
# /1 means "insert the first capture group", which in our
# case is the date
s = re.sub(date_re, r' \1', s)

# split by one or more whitespaces and insert
# a seperator (';') to make working with the string
# easier
s = ';'.join(s.split())

# build a dictionary of replacements
replacements = {
    'WMIN': 'Walmart Inc.',
    'CBOND': 'Corporate Bond',
}

# for each replacement apply subsitution
# a better, but more replicated solution for
# this is given here:
# https://stackoverflow.com/a/15175239/5665958
for pattern, r in replacements.items():
    s = re.sub(pattern, r, s)

# use our custom separator to split the parts
out = s.split(';')
print(out)

Why are you changing the space to a semi-colon to later remove it? — Adirio, Aug 27 '19 at 07:59
You are right this seems to make no sense. But i wanted to make the data consistent. nik_126 mentioned that the data is potentially dirty (whitespaces and so on), which is why I first wanted to clean up the string. Further, after replacing with re.sub, i have introduced whitespaces, thus I cannot split on whitespaces anymore. But you could also first split everything on whitespaces and work from there on, which is probably more efficient but also more complex. — Tankred, Aug 27 '19 at 08:13
Then what if a replacement contains a semi-colon? Try to keep different tasks in different functions. Splitting the 4 columns first. Formatting each column should be done later. — Adirio, Aug 27 '19 at 08:19

score 0 · Answer 4 · answered Aug 27 '19 at 08:33

welcome! Yeah, we would definitely need to see more examples and regex seems to be the way to go... but since there seems to be no structure, I think it's better to think of this as seperate steps.

We KNOW there's a date which is (X)X/(X)X/XXXX (ie. one or two digit day, one or two digit month, four digit year, maybe with or without the slashes, right?) and after that there's numbers. So solve that part first, leaving only the first two categories. That's actually the easy part :) but don't lose heart!
if these two categories might not have ANY delimiter (for example WMINCBOND 12/05/202223554132121, or delimiters are not always delimiters for example IMAGINARY COMPANY X CBOND, then you're in deep trouble. :) BUT this is what we can do:
1. Gather a list of all the codes (hopefully you have that).
2. use str_detect() on each code and see if you can recognize the exact string in any of the dataset (if you do have the codes lemme know I'll write the code to do this part).
What's left after identifying the code will be the CBOND, whatever that is... so do that part last... what's left of the string will be that. Alternatively, you can use the same str_detect() if you have a list of whatever CBOND stuff is.
ONLY AFTER YOU'VE IDENTIFIED EVERYTHING, you can then replace the codes for what they stand for. If you have the code-list let me know and I'll post the code.

Thank you very much! What do you mean with codes in 2.1 (Gather all the codes)? Regarding 2.: Yes the data does not have any delimiters and random spaces as you can see in my edited post now. That's the reason I'm struggling with the separation :( CBOND is just an acronym for corporate bonds but the acronym is changing randomly as well (but I will try to solve this problem after the successful separation) — nik_126, Aug 27 '19 at 13:33
2.1 means, have a list of all codes. And then you can check each one against your dataset using `str_detect()`. An efficient way to do that is to use `purrr::map` do you know how? — Amit Kohli, Aug 27 '19 at 15:54

Is there an R or Python function for separating information in non-delimited strings, where the information varies?

4 Answers4