1

I have a data frame from which i want to extract specific value from a string with a condition.

    DF1

structure(list(Sr. No. = c("1", "2"), String= c("ABCD, your Account XX1987 has been credited with EUR 22,500.00 on 30-
            Oct-17. Info: CAM*CASH DEPOSIT*ELISH SEC. The Available Balance is EUR 
            22,951.57.", 
    "WXYZ, Your Ac XXXXXXXX1987 is debited with USD 5,000.00 on 14 
            May. Info. MMT*125485645*99999999. Your Net Available Balance is 
            USD 20,531.38.)")), .Names = c("Sr. No.", "String"), row.names = 1:2, class = "data.frame")

DF

Sr. No. String

1.      ABCD, your Account XX1987 has been credited with EUR 22,500.00 on 30-
        Oct-17. Info: CAM*CASH DEPOSIT*ELISH SEC. The Available Balance is EUR 
        22,951.57.
2.      WXYZ, Your Ac XXXXXXXX1987 is debited with USD 5,000.00 on 14 
        May. Info. MMT*125485645*99999999. Your Net Available Balance is 
        USD 20,531.38.

From that Dataframe i want below mentioned dataframe with specific condition.

Conditions:
1. Take first coming word credited/debited/credit/debit as "Credit" or "Debit" in type.
2. Take last four digit after your Account/your Ac/your a/c or your acc (or the string lookes like XXXX1234) in Acc.
3. Take first value coming after credited/debited/credit/debit word in the sring as Fig.
4. Take date after word "on" or which lookes like date from string in Date column.
5. Take description in desc after word Info:
6. Take balance after word Available Balance/Net Balance/Balance or Last Numeric figure in the string.

DF2
    Sr.No.      Type      Acc      Fig     Date       Desc             Balance
    1           Credit    1987     22,500  30-10-2017 Info: CAM*CASH   22,951
                                                      DEPOSIT*ELISH SEC.
    2           Debit     1987     5,000   14-May     Info.            20,531.38
                                                      MMT*125485645*99999999.
  • Can you use `dput` on your data so it's easier to make this reproducible? E.g., `dput(DF[1:2,])` – Tyler Rinker Jan 27 '18 at 16:07
  • I haven't tried it yet but i want to understand the logic so that i can train the code on different strings. –  Jan 27 '18 at 16:12
  • It is difficult to help you because you have given us something that can't be run in R. My suggestion above makes it so we can run you code and make DF easy...then it's easier to help you and makes it more likely that you'll get a quick solution. Use dput then edit your question and paste the output so we can grab it and run it in R – Tyler Rinker Jan 27 '18 at 16:20
  • Sure Tyler Rinker, l am trying dput. I'll share the revised version. –  Jan 27 '18 at 16:21
  • Also, have you tried some code that gives you errors, or do you not know where to begin? – Scott Hunter Jan 27 '18 at 16:25
  • I am not sure on R from where to begin, but the same exercise i am doing on excel and till now able to extract type and Fig. (=TRIM(LEFT(SUBSTITUTE(MID(H16,FIND("INR",H16),LEN(H16))," ",REPT(" ",100)),100))) –  Jan 27 '18 at 16:29

2 Answers2

2

I tried to write as general expressions as I could come up with but if the data is not always structured the same way there might be a need to tune Regex

library(stringr)
input = structure(list(
  `Sr. No.`=c("1", "2"), 
  String=c(
    "ABCD, your Account XX1987 has been credited with EUR 22,500.00 on 30-Oct-17. Info: CAM*CASH DEPOSIT*ELISH SEC. The Available Balance is EUR 22,951.57.", 
    "WXYZ, Your Ac XXXXXXXX1987 is debited with USD 5,000.00 on 14 May. Info. MMT*125485645*99999999. Your Net Available Balance is USD 20,531.38.)")), 
  .Names=c("Sr. No.", "String"), row.names=1:2, class="data.frame")

rule_13 = str_match(input$String, "(credit|debit)ed[^0-9]*((?:EUR|USD|INR|Rs) [0-9,.]+)")
rule_2 = str_match(input$String, "(?:Account|your Ac|your a/c|your acc|XX)[^0-9]*([0-9]+)")
rule_4 = str_match(input$String, " on ([0-9]+[ -](?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)|[0-9]+)(?:[ -][0-9]+)?)")
rule_5 = str_match(input$String, "\\bInfo\\b[^\\w\\d]+(.+)(?=\\. )")
rule_6 = str_match(input$String, "(?:Available Balance|Net Balance|Balance)[^0-9]*([0-9,.]+[0-9])")

data.frame(
    Sr.No=input$`Sr. No.`,
    Type=rule_13[,2],
    Acc=rule_2[,2],
    Fig=rule_13[,3],
    Data=rule_4[,2],
    Desc=rule_5[,2],
    Balance=rule_6[,2])

Output

Sr.No   Type  Acc       Fig      Data                       Desc   Balance
    1 credit 1987 22,500.00 30-Oct-17 CAM*CASH DEPOSIT*ELISH SEC 22,951.57
    2  debit 1987  5,000.00    14 May     MMT*125485645*99999999 20,531.38
Sergej Andrejev
  • 8,543
  • 11
  • 65
  • 107
  • I was working on a very similar solution. The regular expressions probably will need to be tuned based on the different possibilities of the real dataset. Here's a good place to play around with regexes: https://regexr.com/ – zelite Jan 27 '18 at 16:45
  • Thanks Sergej Andrejev, I think i have missed one point i don't want Sr. No. in Dataframe instead i have unique id against many strings but this code convert that unique id into Sr. No. (1,2,3,....) i what that particular unique id against each output (i.e if for unique id 123 there are 10 columns that i want output as 123,123,123 etc with extracted field value.) –  Jan 27 '18 at 16:47
  • Sure, I modified the code. I hope I understood your comment correctly – Sergej Andrejev Jan 27 '18 at 16:59
  • Thanks Sergej Andrejev, can you please explain me how can i make such such regex for different kind of string by myself. –  Jan 27 '18 at 17:13
  • @SergeAndrejev, The rule 13 Capturing Type as "Credit" and Fig as"5676757" from this string "credit card statement, only on your email id ABC@GMAIL.COM please sms ESTMT YES to 5676757". Please make the rule13 to capture value when its written Credited/Debited/credited/debited to suppress this kind of error and please explain me bit to how to do modifications in regex so that i can do this by myself. –  Jan 27 '18 at 17:30
  • @SergejAndrejev can we built a logic here for Fig that if EUR,USD,INR or Rs. is written before any numeric value than and only than it consider it as Fig in Output dataframe. –  Jan 27 '18 at 17:43
  • @SergejAndrejev how to cater this string "INR 187,314.00 credited to your A/c No XXXXXXX1234 on 31/10/17 through NEFT with UTR " in this it's capturing 1234 as a Fig instead of 187,314.00 –  Jan 27 '18 at 17:46
  • As I said, the more unstructured data you have the more you will have to tune it and I'm not willing to do your work :), I'm just here to help. I added credit**ed**|debit**ed** and EUR|USD|INR|Rs corrections to the regex, but you will have to solve the third on your own. In your string "INR 187,314.00 credited" the amount comes after the action and that was not what was your initial requirement. – Sergej Andrejev Jan 27 '18 at 18:02
  • @SergejAndrejev Can you give me some hint to solve 3rd point. –  Jan 27 '18 at 18:23
  • I would write another rule that searches for amount before "credited|debited" and if use output from it in cases where the original rule didn't work – Sergej Andrejev Jan 27 '18 at 18:45
  • Will you please explain bit about regex so that I don't have to disturb you again and again... –  Jan 27 '18 at 18:53
  • 1
    Learn about regexes: https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean/ – zelite Jan 27 '18 at 19:25
  • @SergejAndrejev i have writted "(:?EUR|USD|INR|Rs.)[^0-9]*([0-9,.]+)" regex which gives me Fig which comes before credit|debit but don't to how to make two different rule in this case. along with the previously shared rules. –  Jan 28 '18 at 05:46
0

Following the pattern, you can combine all the regular expressionas in a single line and extract the information:

 pat=c(Account="(?<=X)\\d+",
 Type="(credit|debit)",
 Fig="(\\w{1,3}\\s\\d+.*\\.\\d+\\s)",
 Date="(\\d+\\s\\w+\\.)|(?<=on\\s)(\\d+\\W\\w+\\W\\d+)",
 Decs="(Info.*\\.\\s)",
 Balance="(?<=Balance\\s\\is\\s).*\\.")
 data.frame(mapply(str_extract,DF[2],pat))
  String    NA.           NA..1     NA..2                              NA..3          NA..4
1   1987 credit  EUR 22,500.00  30-Oct-17 Info: CAM*CASH DEPOSIT*ELISH SEC.  EUR 22,951.57.
2   1987  debit   USD 5,000.00    14 May.     Info. MMT*125485645*99999999.  USD 20,531.38.
3   1234 credit INR 187,314.00   31/10/17                    
Onyambu
  • 31,432
  • 2
  • 14
  • 36
  • Thanks but not working properly, though above mentioned regex by Sergej Andrejev working fine but i'm stuck where amount comes before credit/debit (i.e "INR 187,314.00 credited to your A/c No XXXXXXX1234 on 31/10/17 through NEFT with UTR "). –  Jan 28 '18 at 09:45
  • Wheris the issue? Where is it not working? Is there anything that is not extracted?I might give a longer pattern and see whether that works!! – Onyambu Jan 28 '18 at 14:41
  • Can you add an example of a line which doesn't work? – Onyambu Jan 28 '18 at 14:56
  • you can check with the answer given by Sergej Andrejev. In addition I want to extract amount where it comes before credit/debit (i.e "INR 187,314.00 credited to your A/c No XXXXXXX1234 on 31/10/17 through NEFT with UTR "). in the same rule. –  Jan 28 '18 at 15:02
  • have you find any solution for this...I think IF condition will work but don't know how to execute it with both these rule. –  Jan 28 '18 at 16:01
  • I can be able to extrac them but not yet rearranged them. Look at the edit: – Onyambu Jan 28 '18 at 16:02
  • I have finally solved the issue. Look at the edited code – Onyambu Jan 28 '18 at 16:37
  • The output is very unstructured. –  Jan 28 '18 at 16:49
  • What does that mean? In the tree examples you gave I did output what was intended. What do you mean by the output being unstructured? – Onyambu Jan 28 '18 at 16:51
  • I have uploaded the dataframe in Answer Tab for your reference. Let me tell you my question again I have a dataframe with Two column one is 1.MobileNumber 2.Sms and for that I want the desired output as you have done in the code, for out put the first column should be mobile number then acc , fig, type etc. –  Jan 28 '18 at 16:52
  • sorry this just deals with the bank information given above. Try it and you will see that it does work I tend to believe any questions should be asked anew. Thank you. – Onyambu Jan 28 '18 at 16:56