Replace NAs with their respective column means from very large text file

I have a large text file: 400k rows and 10k columns, all numeric data values as 0,1,2. File size ranging 5-10GBs. I have a few missing values: NAs in the file. I want to replace the NA values with the column means, i.e. NA value in column 'x' must be replaced by the mean value of column 'x'. These are the steps that I want to do :

Compute means of each column of my text file (excluding the header and starting from column7th)

Replace NA in each column with their respective column means

Write the modified file back as a txt file

Data subset:

IID  FID  PAT MAT SEX PHENOTYPE X1 X2 X3 X4......

1234 1234  0  0    1   -9       0  NA  0  1 

2346 2346  0  0    2   -9       1  2  NA  1

1334 1334  0  0    2   -9       2  NA  0  2

4566 4566  0  0    2   -9       2  2  NA  0

4567 4567  0  0    1   -9       NA NA  1  1



# total 400k rows and 10k columns

Desired Output:

# Assuming only 5 rows as given in the above example.

# Mean of column X1 = (0 + 1+ 2+ 2)/4 = 1.25

# Mean of column X2 =  (2 + 2)/2 = 2

# Mean of column X3 = (0 + 0 + 1)/3 = 0.33

# Mean of column X4 = No NAs, so no replacements



# Replacing NAs with respective means:



IID  FID  PAT MAT SEX PHENOTYPE X1   X2  X3   X4......

1234 1234  0  0    1   -9       0    2   0     1 

2346 2346  0  0    2   -9       1    2   0.33  1

1334 1334  0  0    2   -9       2    2   0     2

4566 4566  0  0    2   -9       2    2   0.33  0

4567 4567  0  0    1   -9       1.25 2   1     1

I tried this:

file="path/to/data.txt"



#get total number of columns

number_cols=$(awk -F' ' '{print NF; exit}' $file)



for ((i=7; i<=$number_cols; i=i+1))

do 

    echo $i

    # getting the mean of each column

    mean+=$(awk '{ total += $i } END { print total/NR }' $file)

done



# array of column means

echo ${mean[@]}



# find and replace (newstr must be replaced by respective column means)

find $file -type f -exec sed -i 's/NA/newstr/g' {} ;

However, this code is incomplete. The for loop is very slow since my data is huge. Is there another way to do this faster? I did this in Python and R, but it is too slow. I am open to get this done in any programming language as long as it is fast. Can someone please help me write the script?

Thanks

edited Jan 18 at 21:54

Poshi

2,3161821

asked Jan 18 at 18:57

user3720537

142

Please add your desired output for that sample input to your question.

– Cyrus
Jan 18 at 19:14

Added the desired output, thanks

– user3720537
Jan 18 at 20:02

I would recommend to use a file format that is designed for huge files, like HDF5. This would definitely help you to get and set the right values to the right places.

– MaxPowers
Jan 18 at 21:10

1

(1) First, your approach reads the entire data #column times, which is very inefficient. Columnwise operation requires frequent seek across rows anyway. Try to transpose the data in advance so you can access all entries of the specified column as a line. Then you can store the column data in a memory all at once to access them swiftly. (2) Do not use bash, awk or other scripting language. Their indexing mechanism of array is not efficient in terms of access time. I'd adopt C or C++ which stores each element of array in a linear address, enabling the addressing of the element very fast.

– tshiono
Jan 19 at 2:50

(3) Please reconsider the purpose and the requirement. Interpolating the NA value with other values means you are creating untrue data anyway. I don't think you need to use whole column data for the purpose. I'd just use the value in the previos line or a few lines around. It might be better to discard the row which contains NA value as a data cleansing depending on the purpose. The decision may require statistics background and engineering sense of compromisation.

– tshiono
Jan 19 at 2:50

|
show 2 more comments

Compute means of each column of my text file (excluding the header and starting from column7th)

Replace NA in each column with their respective column means

Write the modified file back as a txt file

Data subset:

IID  FID  PAT MAT SEX PHENOTYPE X1 X2 X3 X4......

1234 1234  0  0    1   -9       0  NA  0  1 

2346 2346  0  0    2   -9       1  2  NA  1

1334 1334  0  0    2   -9       2  NA  0  2

4566 4566  0  0    2   -9       2  2  NA  0

4567 4567  0  0    1   -9       NA NA  1  1



# total 400k rows and 10k columns

Desired Output:

# Assuming only 5 rows as given in the above example.

# Mean of column X1 = (0 + 1+ 2+ 2)/4 = 1.25

# Mean of column X2 =  (2 + 2)/2 = 2

# Mean of column X3 = (0 + 0 + 1)/3 = 0.33

# Mean of column X4 = No NAs, so no replacements



# Replacing NAs with respective means:



IID  FID  PAT MAT SEX PHENOTYPE X1   X2  X3   X4......

1234 1234  0  0    1   -9       0    2   0     1 

2346 2346  0  0    2   -9       1    2   0.33  1

1334 1334  0  0    2   -9       2    2   0     2

4566 4566  0  0    2   -9       2    2   0.33  0

4567 4567  0  0    1   -9       1.25 2   1     1

I tried this:

file="path/to/data.txt"



#get total number of columns

number_cols=$(awk -F' ' '{print NF; exit}' $file)



for ((i=7; i<=$number_cols; i=i+1))

do 

    echo $i

    # getting the mean of each column

    mean+=$(awk '{ total += $i } END { print total/NR }' $file)

done



# array of column means

echo ${mean[@]}



# find and replace (newstr must be replaced by respective column means)

find $file -type f -exec sed -i 's/NA/newstr/g' {} ;

Thanks

edited Jan 18 at 21:54

Poshi

2,3161821

asked Jan 18 at 18:57

user3720537

142

Please add your desired output for that sample input to your question.

– Cyrus
Jan 18 at 19:14

Added the desired output, thanks

– user3720537
Jan 18 at 20:02

I would recommend to use a file format that is designed for huge files, like HDF5. This would definitely help you to get and set the right values to the right places.

– MaxPowers
Jan 18 at 21:10

1

(1) First, your approach reads the entire data #column times, which is very inefficient. Columnwise operation requires frequent seek across rows anyway. Try to transpose the data in advance so you can access all entries of the specified column as a line. Then you can store the column data in a memory all at once to access them swiftly. (2) Do not use bash, awk or other scripting language. Their indexing mechanism of array is not efficient in terms of access time. I'd adopt C or C++ which stores each element of array in a linear address, enabling the addressing of the element very fast.

– tshiono
Jan 19 at 2:50

(3) Please reconsider the purpose and the requirement. Interpolating the NA value with other values means you are creating untrue data anyway. I don't think you need to use whole column data for the purpose. I'd just use the value in the previos line or a few lines around. It might be better to discard the row which contains NA value as a data cleansing depending on the purpose. The decision may require statistics background and engineering sense of compromisation.

– tshiono
Jan 19 at 2:50

|
show 2 more comments

Compute means of each column of my text file (excluding the header and starting from column7th)

Replace NA in each column with their respective column means

Write the modified file back as a txt file

Data subset:

IID  FID  PAT MAT SEX PHENOTYPE X1 X2 X3 X4......

1234 1234  0  0    1   -9       0  NA  0  1 

2346 2346  0  0    2   -9       1  2  NA  1

1334 1334  0  0    2   -9       2  NA  0  2

4566 4566  0  0    2   -9       2  2  NA  0

4567 4567  0  0    1   -9       NA NA  1  1



# total 400k rows and 10k columns

Desired Output:

# Assuming only 5 rows as given in the above example.

# Mean of column X1 = (0 + 1+ 2+ 2)/4 = 1.25

# Mean of column X2 =  (2 + 2)/2 = 2

# Mean of column X3 = (0 + 0 + 1)/3 = 0.33

# Mean of column X4 = No NAs, so no replacements



# Replacing NAs with respective means:



IID  FID  PAT MAT SEX PHENOTYPE X1   X2  X3   X4......

1234 1234  0  0    1   -9       0    2   0     1 

2346 2346  0  0    2   -9       1    2   0.33  1

1334 1334  0  0    2   -9       2    2   0     2

4566 4566  0  0    2   -9       2    2   0.33  0

4567 4567  0  0    1   -9       1.25 2   1     1

I tried this:

file="path/to/data.txt"



#get total number of columns

number_cols=$(awk -F' ' '{print NF; exit}' $file)



for ((i=7; i<=$number_cols; i=i+1))

do 

    echo $i

    # getting the mean of each column

    mean+=$(awk '{ total += $i } END { print total/NR }' $file)

done



# array of column means

echo ${mean[@]}



# find and replace (newstr must be replaced by respective column means)

find $file -type f -exec sed -i 's/NA/newstr/g' {} ;

Thanks

edited Jan 18 at 21:54

Poshi

2,3161821

asked Jan 18 at 18:57

user3720537

142

Compute means of each column of my text file (excluding the header and starting from column7th)

Replace NA in each column with their respective column means

Write the modified file back as a txt file

Data subset:

IID  FID  PAT MAT SEX PHENOTYPE X1 X2 X3 X4......

1234 1234  0  0    1   -9       0  NA  0  1 

2346 2346  0  0    2   -9       1  2  NA  1

1334 1334  0  0    2   -9       2  NA  0  2

4566 4566  0  0    2   -9       2  2  NA  0

4567 4567  0  0    1   -9       NA NA  1  1



# total 400k rows and 10k columns

Desired Output:

# Assuming only 5 rows as given in the above example.

# Mean of column X1 = (0 + 1+ 2+ 2)/4 = 1.25

# Mean of column X2 =  (2 + 2)/2 = 2

# Mean of column X3 = (0 + 0 + 1)/3 = 0.33

# Mean of column X4 = No NAs, so no replacements



# Replacing NAs with respective means:



IID  FID  PAT MAT SEX PHENOTYPE X1   X2  X3   X4......

1234 1234  0  0    1   -9       0    2   0     1 

2346 2346  0  0    2   -9       1    2   0.33  1

1334 1334  0  0    2   -9       2    2   0     2

4566 4566  0  0    2   -9       2    2   0.33  0

4567 4567  0  0    1   -9       1.25 2   1     1

I tried this:

file="path/to/data.txt"



#get total number of columns

number_cols=$(awk -F' ' '{print NF; exit}' $file)



for ((i=7; i<=$number_cols; i=i+1))

do 

    echo $i

    # getting the mean of each column

    mean+=$(awk '{ total += $i } END { print total/NR }' $file)

done



# array of column means

echo ${mean[@]}



# find and replace (newstr must be replaced by respective column means)

find $file -type f -exec sed -i 's/NA/newstr/g' {} ;

Thanks

linux bash shell

edited Jan 18 at 21:54

Poshi

2,3161821

asked Jan 18 at 18:57

user3720537

142

edited Jan 18 at 21:54

Poshi

2,3161821

asked Jan 18 at 18:57

user3720537

142

edited Jan 18 at 21:54

Poshi

2,3161821

edited Jan 18 at 21:54

Poshi

2,3161821

edited Jan 18 at 21:54

Poshi

2,3161821

asked Jan 18 at 18:57

user3720537

142

asked Jan 18 at 18:57

user3720537

142

asked Jan 18 at 18:57

user3720537

142

Please add your desired output for that sample input to your question.

– Cyrus
Jan 18 at 19:14

Added the desired output, thanks

– user3720537
Jan 18 at 20:02

I would recommend to use a file format that is designed for huge files, like HDF5. This would definitely help you to get and set the right values to the right places.

– MaxPowers
Jan 18 at 21:10

1

(1) First, your approach reads the entire data #column times, which is very inefficient. Columnwise operation requires frequent seek across rows anyway. Try to transpose the data in advance so you can access all entries of the specified column as a line. Then you can store the column data in a memory all at once to access them swiftly. (2) Do not use bash, awk or other scripting language. Their indexing mechanism of array is not efficient in terms of access time. I'd adopt C or C++ which stores each element of array in a linear address, enabling the addressing of the element very fast.

– tshiono
Jan 19 at 2:50

(3) Please reconsider the purpose and the requirement. Interpolating the NA value with other values means you are creating untrue data anyway. I don't think you need to use whole column data for the purpose. I'd just use the value in the previos line or a few lines around. It might be better to discard the row which contains NA value as a data cleansing depending on the purpose. The decision may require statistics background and engineering sense of compromisation.

– tshiono
Jan 19 at 2:50

|
show 2 more comments

Please add your desired output for that sample input to your question.

– Cyrus
Jan 18 at 19:14

Added the desired output, thanks

– user3720537
Jan 18 at 20:02

I would recommend to use a file format that is designed for huge files, like HDF5. This would definitely help you to get and set the right values to the right places.

– MaxPowers
Jan 18 at 21:10

1

(1) First, your approach reads the entire data #column times, which is very inefficient. Columnwise operation requires frequent seek across rows anyway. Try to transpose the data in advance so you can access all entries of the specified column as a line. Then you can store the column data in a memory all at once to access them swiftly. (2) Do not use bash, awk or other scripting language. Their indexing mechanism of array is not efficient in terms of access time. I'd adopt C or C++ which stores each element of array in a linear address, enabling the addressing of the element very fast.

– tshiono
Jan 19 at 2:50

(3) Please reconsider the purpose and the requirement. Interpolating the NA value with other values means you are creating untrue data anyway. I don't think you need to use whole column data for the purpose. I'd just use the value in the previos line or a few lines around. It might be better to discard the row which contains NA value as a data cleansing depending on the purpose. The decision may require statistics background and engineering sense of compromisation.

– tshiono
Jan 19 at 2:50

Please add your desired output for that sample input to your question.

– Cyrus
Jan 18 at 19:14

Added the desired output, thanks

– user3720537
Jan 18 at 20:02

I would recommend to use a file format that is designed for huge files, like HDF5. This would definitely help you to get and set the right values to the right places.

– MaxPowers
Jan 18 at 21:10

(1) First, your approach reads the entire data #column times, which is very inefficient. Columnwise operation requires frequent seek across rows anyway. Try to transpose the data in advance so you can access all entries of the specified column as a line. Then you can store the column data in a memory all at once to access them swiftly. (2) Do not use bash, awk or other scripting language. Their indexing mechanism of array is not efficient in terms of access time. I'd adopt C or C++ which stores each element of array in a linear address, enabling the addressing of the element very fast.

– tshiono
Jan 19 at 2:50

(3) Please reconsider the purpose and the requirement. Interpolating the NA value with other values means you are creating untrue data anyway. I don't think you need to use whole column data for the purpose. I'd just use the value in the previos line or a few lines around. It might be better to discard the row which contains NA value as a data cleansing depending on the purpose. The decision may require statistics background and engineering sense of compromisation.

– tshiono
Jan 19 at 2:50

|
show 2 more comments

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54259950%2freplace-nas-with-their-respective-column-means-from-very-large-text-file%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Brtdku