Questions tagged [data-wrangling]

482 questions
5
votes
4 answers

How to write an efficient wrapper for data wrangling, allowing to turn off any wrapped part when calling the wrapper

To streamline data wrangling, I write a wrapper function consisted of several "verb functions" that process the data. Each one performs one task on the data. However, not all tasks are applicable to all datasets that pass through this process, and…
Emman
  • 1,295
  • 8
  • 19
5
votes
1 answer

Data manipulation in Pandas: create a boolean column from values on column then fill with value from yet another column

ok, I've been trying this for too long, time to ask for help. I have a dataframe that looks a bit like this: person fruit quantity all_fruits 0 p1 grapes 2 [grapes, banana] 1 p1 banana 1 [grapes, banana] 2 p2…
4
votes
3 answers

How to get Pandas df.merge() mismatch column name

Given the following data: data_df = pd.DataFrame({ "Reference": ("A", "A", "A", "B", "C", "C", "D", "E"), "Value1": ("U", "U", "U--","V", "W", "W--", "X", "Y"), "Value2": ("u", "u--", "u","v", "w", "w", "x", "y") }, index=[1, 2, 3,…
Ricardo Sanchez
  • 3,945
  • 8
  • 44
  • 73
4
votes
4 answers

Top "n" rows of each group using dplyr -- with different number per group

I'll use the built-in chickwts data as an example. Here's the data, there are 5 feed types. > head(chickwts) weight feed 1 179 horsebean 2 160 horsebean 3 136 horsebean 4 227 horsebean 5 217 horsebean 6 168 horsebean >…
876868587
  • 2,802
  • 2
  • 16
  • 43
4
votes
3 answers

Check if values of one dataframe exist in another dataframe in exact order

I have 1 dataframe of data and multiple "reference" dataframes. I'm trying to automate checking if values of the dataframe match the values of the reference dataframes. Importantly, the values must also be in the same order as the values in the…
psychcoder
  • 447
  • 1
  • 7
4
votes
1 answer

R: Changing column names in pivot_wider() -- suffix to prefix

I'm trying to figure out how to alter the way in which tidyr's pivot_wider() function creates new variable names in resulting wide data sets. Specifically, I would like the "names_from" variable to be added to the prefix of the new variables rather…
mkpcr
  • 171
  • 10
3
votes
3 answers

Create a new column based on the the values and heading of another dataset

Say I have an original dataset whose values in the first column are from a to d in the alphabet df1: a x1 b x2 c x3 d x4 e x5 and then I have another dataset which multiple columns but whose entries reference the columns in the aforementioned…
user849541
  • 93
  • 5
3
votes
3 answers

Tidy data with variable in intermittent rows

I have datalogger that inserts a row with a timestamp every time the logger is turned on. The timestamp string is always the same format, but there are an inconsistent number of readings per timestamp. How do I tidy the timestamp rows into a time…
JMDR
  • 109
  • 7
3
votes
1 answer

Pivot_longer() in R without separator?

I am trying to transform a table using pivot_longer() in R. But the separation is not by any common symbol such as "_" or "." but rather by how the column name ends (either "B" or "T"). I tried to use regular expression but not much success. Below…
Nick
  • 93
  • 5
3
votes
1 answer

Calculating the ratio between the average engine life expectancies

I have a small R dataframe below containing cars made in Japan and in Mexico from 2006 to 2008. I need to calculate the ratio between the average engine life for the cars built in Japan and Mexico for each year. I am using dplyr and so far I have…
Kintaro
  • 157
  • 1
  • 7
3
votes
1 answer

How to get a series from a pandas dataframe using a series of column names?

I have a pandas dataframe df with numeric data. I also have a series s with the same index as df and values consisting of df column labels, e.g. import pandas as pd df = pd.DataFrame( index=[0, 1, 2], columns=[0, 1, 2], data=[[1, 2, 3], [4,…
nijshar28
  • 93
  • 5
3
votes
4 answers

Is there a way to build a pairwise data frame based on shared values in another data frame in R?

For example DF1 is: Id1 Id2 1 10 2 10 3 7 4 7 5 10 And want DF2: Id1 Id2 1 2 1 5 2 5 3 4 The data frame DF2 is a pairwise set of values from Id1 column in DF1 that shared a common value in Id2 of…
3
votes
1 answer

Turn column levels inside-out

I have a pandas DataFrame which looks like this (code to create it is at the bottom of the question): col_1 col_2 foo_1 foo_2 col_3 col_4 col_3 col_4 0 1 4 2 8 5 7 1 3 1 6 3 8 …
ignoring_gravity
  • 3,911
  • 3
  • 16
  • 33
3
votes
3 answers

How do I find the two lowest values across selected columns in each row of a pandas dataframe?

In calculating grades, I drop each student's two lowest homework scores. A sample dataframe is shown here: df=pd.DataFrame([[10, 9, 10, 5, 7], [8, 7, 9, 9, 4], [10, 10, 7, 0, 8], [5, 9, 7, 6, 3], [10, 5, 0, 8, 10], [8, 9, 10, 10,…
3
votes
3 answers

How to create a percentage column based on the values present in every third row?

I have a data frame containing the values of weight. I have a create a new column, percentage change of weight wherein the denominator takes the value of every third row. df <- data.frame(weight = c(30,30,109,30,309,10,20,20,14)) # expected…
Silent_bliss
  • 307
  • 6
1
2 3
32 33